AI-in-the-Loop
Reduce review volume by letting AI score, suggest, and pre-triage work before it reaches a person.
What This Means in M3 Forge
AI-in-the-Loop extends automation beyond generation and extraction. In M3 Forge, AI can also:
- Evaluate outputs with workspace evaluators and guardrails
- Suggest annotations during processor labeling
- Score traces and sessions in observability flows
- Compare prompt variants before promotion
- Route only the hardest exceptions to human reviewers
This pattern gives teams a middle layer between “fully automatic” and “fully manual.”
Where It Shows Up
| Surface | AI-in-the-loop behavior |
|---|---|
| Prompt Compare | Score prompt variants with saved evaluators before rollout |
| Guardrails | Route low-quality outputs to retry, fallback, or review |
| Observability | Run evaluator schedules against traces and sessions and persist score annotations |
| Processor Labeling | Suggest annotations to accelerate dataset creation |
| HITL | Deliver smaller, better-prioritized exception queues to people |
Example: Review Queue Compression
Generate or extract the first-pass result
Run the normal workflow, processor, or agent flow.
Score the output
Apply evaluators such as faithfulness, relevance, schema checks, or custom code evaluators.
Annotate production behavior
Persist evaluator results into trace or session annotations so teams can see quality patterns over time.
Send only weak cases to humans
Use threshold failures, missing fields, or policy exceptions to determine which items enter the HITL queue.
Example Pattern
Example Evaluator Configuration
{
"name": "invoice_quality",
"kind": "ts_code",
"target": "trace",
"threshold": 0.85,
"dimensions": ["schema", "field_completeness", "groundedness"]
}Good Fits
- Reduce prompt-comparison guesswork before shipping a new prompt version
- Score production traces continuously instead of waiting for manual QA sampling
- Speed up extractor training with AI-generated annotation suggestions
- Keep humans focused on novel errors rather than obvious passes
Human-in-the-Loop vs AI-in-the-Loop
| Capability | Primary actor | Best used for |
|---|---|---|
| Human-in-the-Loop | Reviewer | Approvals, policy calls, business exceptions |
| AI-in-the-Loop | Evaluator or assistant model | Scoring, suggestion, filtering, pre-triage |
The strongest systems use both. AI shrinks the queue. Humans resolve what still needs judgment.
AI-in-the-loop is most valuable when it is observable. Score annotations, evaluator history, and prompt comparisons turn reviewer effort into measurable signal instead of hidden labor.