LLM Scheduler
LLM Scheduler controls how queued model work is dispatched inside a Runtime Fabric group. It lets operators keep a catch-all default lane, add workload-specific pools, and tune fairness without changing gateway membership.
The scheduler is scoped to a fabric group. Executors submit LLM work to a pool, the gateway runtime reads the group’s scheduler config, and the dispatcher selects queued requests from the eligible pools.
When To Use Pools
Use more than the default pool when different workloads should share the same Runtime Fabric group but should not compete as a single queue.
| Pool pattern | Example |
|---|---|
| Interactive | Chat, operator actions, short agent turns |
| Document work | Extractors, vision-heavy prompts, batch document review |
| Backfill | Reprocessing, replay, non-urgent enrichment |
| Experimental | New provider endpoint, model, or routing policy |
The default pool is the catch-all lane. Untargeted requests should always be able to land there.
Scheduler Fields
Policy
drr uses Deficit Round Robin and is the normal policy for pool-aware dispatch. fifo ignores pool weights and processes a single queue in arrival order.
Global Dispatch Limit
The total number of in-flight requests the dispatcher can run across all pools in the fabric group.
Blank in Studio stores 0, which means “use the runtime default.” The runtime default comes from LLM_QUEUE_MAX_BATCH_ITEMS; if it is not set, Marie uses the queue config default.
Use an explicit number only when this fabric group needs a tighter global cap than the gateway runtime default.
Pool Fields
| Field | Meaning |
|---|---|
| Pool ID | Stable queue lane identifier used by producers and the dispatcher. |
| Display Name | Human-readable label shown in Studio and runtime snapshots. |
| Endpoint URL | Optional OpenAI-compatible backend URL for this pool. Blank uses the gateway runtime backend. |
| Weight | DRR scheduling weight. Higher values receive more dispatch credit when pools compete. |
| Protected Running | Minimum in-flight slots preserved for this pool while it has backlog. Idle protected capacity can be borrowed. |
| Max Running | Hard per-pool in-flight cap. Blank means the pool has no cap beyond the global dispatch limit. |
| Max Burst | Maximum requests the pool may launch during one scheduler visit. Blank lets accumulated DRR credit decide. |
| Sort Order | Display and snapshot ordering. It does not change dispatch share. |
Studio calls the DRR quantum field “Weight” because operators usually need the relative share, not the scheduler-internal name.
How Weight Works
Weight is relative. It is not a concurrency limit.
If both pools always have queued work and requests have equal cost:
| Pool | Weight | Approximate dispatch share |
|---|---|---|
interactive | 4 | 4 requests |
backfill | 1 | 1 request |
The selection pattern is roughly:
interactive
interactive
interactive
interactive
backfillIf a pool is idle, other pools can use the available global capacity. When the idle pool gets backlog again, DRR credit and protected capacity bring it back into the rotation.
Request Cost
DRR does not treat every request as identical. Each queued request has a cost in scheduler units:
- explicit
estimated_cost_unitswins when provided - image requests add cost
- document page count can add cost
- cost is clamped between 1 and 16
A pool must accumulate enough credit to dispatch the request at the head of its queue. This keeps large document or image-heavy requests from consuming the same scheduling share as very small text-only requests.
Capacity Fields
Use the capacity fields when weight alone is not enough.
| Goal | Setting |
|---|---|
| Keep a pool from being starved | Set Protected Running above 0 |
| Stop a noisy pool from filling all slots | Set Max Running |
| Avoid one visit launching many small requests | Set Max Burst |
| Keep default behavior | Leave Max Running and Max Burst blank |
Protected capacity is not wasted. If the protected pool has no backlog, other pools may borrow those slots.
Predefined Pool Profiles
Studio includes an Apply Pool Profile action on the LLM Scheduler page. It previews the selected profile before applying it, then creates or updates only the listed pools and keeps the default pool as the catch-all lane.
Document S/M/L
Use this when producers can classify document work by size before enqueueing.
| Pool | Weight | Protected Running | Max Running | Max Burst | Use when |
|---|---|---|---|---|---|
document-small | 4 | 1 | 4 | 2 | Short text-first documents should move quickly. |
document-medium | 2 | 1 | 3 | 1 | Normal extraction and classification work needs steady throughput. |
document-large | 1 | 0 | 1 | 1 | Large, image-heavy, or expensive documents should be paced. |
Interactive + Backfill
Use this when operator-facing work should stay responsive while background jobs continue to make progress.
| Pool | Weight | Protected Running | Max Running | Max Burst | Use when |
|---|---|---|---|---|---|
interactive | 4 | 1 | Blank | Blank | Chat, operator actions, or short agent turns need priority. |
backfill | 1 | 0 | 2 | 1 | Replay, reprocessing, or non-urgent enrichment should keep draining without filling every slot. |
Document Pipeline
Use this when producers know the document-processing stage before enqueueing.
| Pool | Weight | Protected Running | Max Running | Max Burst | Use when |
|---|---|---|---|---|---|
document-extract | 3 | 1 | 3 | 1 | Extraction gets steady capacity for higher-cost model calls. |
document-validate | 2 | 1 | 2 | 1 | Correction and retry work should not block extraction. |
document-enrich | 1 | 0 | 2 | 1 | Optional or downstream prompts should be paced. |
Requests without a size-specific, stage-specific, or workload-specific pool should still flow into default.
The default pool remains useful for unclassified, legacy, manual, or fallback work. It is not a wildcard over every possible pool name: producers should route unmatched work to default before enqueueing.
Starter Configurations
Single default pool
Use this when all LLM work can share one lane.
| Pool | Weight | Protected Running | Max Running | Max Burst |
|---|---|---|---|---|
default | 1 | 0 | Blank | Blank |
Interactive plus backfill
Use this when operator-facing work should stay responsive while background jobs continue to make progress.
| Pool | Weight | Protected Running | Max Running | Max Burst |
|---|---|---|---|---|
interactive | 4 | 1 | Blank | Blank |
backfill | 1 | 0 | 2 | 1 |
default | 1 | 0 | Blank | Blank |
Document extraction lane
Use this when document or image-heavy requests have higher scheduler cost.
| Pool | Weight | Protected Running | Max Running | Max Burst |
|---|---|---|---|---|
document-extract | 2 | 1 | 3 | 1 |
interactive | 4 | 1 | Blank | Blank |
default | 1 | 0 | Blank | Blank |
Operational Notes
- Keep the
defaultpool enabled as the catch-all lane. - Increase weight to change relative share.
- Use protected running for service guarantees.
- Use max running to cap noisy pools.
- Use endpoint URL only when a pool should use a different backend from the gateway default.
- Watch the LLM Dispatch Runtime page to confirm queue depth, in-flight work, latency, and failures.