LLM Scheduler

LLM Scheduler controls how queued model work is dispatched inside a Runtime Fabric group. It lets operators keep a catch-all default lane, add workload-specific pools, and tune fairness without changing gateway membership.

The scheduler is scoped to a fabric group. Executors submit LLM work to a pool, the gateway runtime reads the group’s scheduler config, and the dispatcher selects queued requests from the eligible pools.

When To Use Pools

Use more than the default pool when different workloads should share the same Runtime Fabric group but should not compete as a single queue.

Pool pattern	Example
Interactive	Chat, operator actions, short agent turns
Document work	Extractors, vision-heavy prompts, batch document review
Backfill	Reprocessing, replay, non-urgent enrichment
Experimental	New provider endpoint, model, or routing policy

The default pool is the catch-all lane. Untargeted requests should always be able to land there.

Scheduler Fields

Policy

drr uses Deficit Round Robin and is the normal policy for pool-aware dispatch. fifo ignores pool weights and processes a single queue in arrival order.

Global Dispatch Limit

The total number of in-flight requests the dispatcher can run across all pools in the fabric group.

Blank in Studio stores 0, which means “use the runtime default.” The runtime default comes from LLM_QUEUE_MAX_BATCH_ITEMS; if it is not set, Marie uses the queue config default.

Use an explicit number only when this fabric group needs a tighter global cap than the gateway runtime default.

Pool Fields

Field	Meaning
Pool ID	Stable queue lane identifier used by producers and the dispatcher.
Display Name	Human-readable label shown in Studio and runtime snapshots.
Endpoint URL	Optional OpenAI-compatible backend URL for this pool. Blank uses the gateway runtime backend.
Weight	DRR scheduling weight. Higher values receive more dispatch credit when pools compete.
Protected Running	Minimum in-flight slots preserved for this pool while it has backlog. Idle protected capacity can be borrowed.
Max Running	Hard per-pool in-flight cap. Blank means the pool has no cap beyond the global dispatch limit.
Max Burst	Maximum requests the pool may launch during one scheduler visit. Blank lets accumulated DRR credit decide.
Sort Order	Display and snapshot ordering. It does not change dispatch share.

Studio calls the DRR quantum field “Weight” because operators usually need the relative share, not the scheduler-internal name.

How Weight Works

Weight is relative. It is not a concurrency limit.

If both pools always have queued work and requests have equal cost:

Pool	Weight	Approximate dispatch share
`interactive`	4	4 requests
`backfill`	1	1 request

The selection pattern is roughly:


interactive
interactive
interactive
interactive
backfill

If a pool is idle, other pools can use the available global capacity. When the idle pool gets backlog again, DRR credit and protected capacity bring it back into the rotation.

Request Cost

DRR does not treat every request as identical. Each queued request has a cost in scheduler units:

explicit estimated_cost_units wins when provided
image requests add cost
document page count can add cost
cost is clamped between 1 and 16

A pool must accumulate enough credit to dispatch the request at the head of its queue. This keeps large document or image-heavy requests from consuming the same scheduling share as very small text-only requests.

Capacity Fields

Use the capacity fields when weight alone is not enough.

Goal	Setting
Keep a pool from being starved	Set Protected Running above 0
Stop a noisy pool from filling all slots	Set Max Running
Avoid one visit launching many small requests	Set Max Burst
Keep default behavior	Leave Max Running and Max Burst blank

Protected capacity is not wasted. If the protected pool has no backlog, other pools may borrow those slots.

Predefined Pool Profiles

Studio includes an Apply Pool Profile action on the LLM Scheduler page. It previews the selected profile before applying it, then creates or updates only the listed pools and keeps the default pool as the catch-all lane.

Document S/M/L

Use this when producers can classify document work by size before enqueueing.

Pool	Weight	Protected Running	Max Running	Max Burst	Use when
`document-small`	4	1	4	2	Short text-first documents should move quickly.
`document-medium`	2	1	3	1	Normal extraction and classification work needs steady throughput.
`document-large`	1	0	1	1	Large, image-heavy, or expensive documents should be paced.

Interactive + Backfill

Use this when operator-facing work should stay responsive while background jobs continue to make progress.

Pool	Weight	Protected Running	Max Running	Max Burst	Use when
`interactive`	4	1	Blank	Blank	Chat, operator actions, or short agent turns need priority.
`backfill`	1	0	2	1	Replay, reprocessing, or non-urgent enrichment should keep draining without filling every slot.

Document Pipeline

Use this when producers know the document-processing stage before enqueueing.

Pool	Weight	Protected Running	Max Running	Max Burst	Use when
`document-extract`	3	1	3	1	Extraction gets steady capacity for higher-cost model calls.
`document-validate`	2	1	2	1	Correction and retry work should not block extraction.
`document-enrich`	1	0	2	1	Optional or downstream prompts should be paced.

Requests without a size-specific, stage-specific, or workload-specific pool should still flow into default.

The default pool remains useful for unclassified, legacy, manual, or fallback work. It is not a wildcard over every possible pool name: producers should route unmatched work to default before enqueueing.

Starter Configurations

Single default pool

Use this when all LLM work can share one lane.

Pool	Weight	Protected Running	Max Running	Max Burst
`default`	1	0	Blank	Blank

Interactive plus backfill