Skip to Content
InfrastructureLLM Scheduler

LLM Scheduler

LLM Scheduler controls how queued model work is dispatched inside a Runtime Fabric group. It lets operators keep a catch-all default lane, add workload-specific pools, and tune fairness without changing gateway membership.

The scheduler is scoped to a fabric group. Executors submit LLM work to a pool, the gateway runtime reads the group’s scheduler config, and the dispatcher selects queued requests from the eligible pools.

When To Use Pools

Use more than the default pool when different workloads should share the same Runtime Fabric group but should not compete as a single queue.

Pool patternExample
InteractiveChat, operator actions, short agent turns
Document workExtractors, vision-heavy prompts, batch document review
BackfillReprocessing, replay, non-urgent enrichment
ExperimentalNew provider endpoint, model, or routing policy

The default pool is the catch-all lane. Untargeted requests should always be able to land there.

Scheduler Fields

Policy

drr uses Deficit Round Robin and is the normal policy for pool-aware dispatch. fifo ignores pool weights and processes a single queue in arrival order.

Global Dispatch Limit

The total number of in-flight requests the dispatcher can run across all pools in the fabric group.

Blank in Studio stores 0, which means “use the runtime default.” The runtime default comes from LLM_QUEUE_MAX_BATCH_ITEMS; if it is not set, Marie uses the queue config default.

Use an explicit number only when this fabric group needs a tighter global cap than the gateway runtime default.

Pool Fields

FieldMeaning
Pool IDStable queue lane identifier used by producers and the dispatcher.
Display NameHuman-readable label shown in Studio and runtime snapshots.
Endpoint URLOptional OpenAI-compatible backend URL for this pool. Blank uses the gateway runtime backend.
WeightDRR scheduling weight. Higher values receive more dispatch credit when pools compete.
Protected RunningMinimum in-flight slots preserved for this pool while it has backlog. Idle protected capacity can be borrowed.
Max RunningHard per-pool in-flight cap. Blank means the pool has no cap beyond the global dispatch limit.
Max BurstMaximum requests the pool may launch during one scheduler visit. Blank lets accumulated DRR credit decide.
Sort OrderDisplay and snapshot ordering. It does not change dispatch share.

Studio calls the DRR quantum field “Weight” because operators usually need the relative share, not the scheduler-internal name.

How Weight Works

Weight is relative. It is not a concurrency limit.

If both pools always have queued work and requests have equal cost:

PoolWeightApproximate dispatch share
interactive44 requests
backfill11 request

The selection pattern is roughly:

interactive interactive interactive interactive backfill

If a pool is idle, other pools can use the available global capacity. When the idle pool gets backlog again, DRR credit and protected capacity bring it back into the rotation.

Request Cost

DRR does not treat every request as identical. Each queued request has a cost in scheduler units:

  • explicit estimated_cost_units wins when provided
  • image requests add cost
  • document page count can add cost
  • cost is clamped between 1 and 16

A pool must accumulate enough credit to dispatch the request at the head of its queue. This keeps large document or image-heavy requests from consuming the same scheduling share as very small text-only requests.

Capacity Fields

Use the capacity fields when weight alone is not enough.

GoalSetting
Keep a pool from being starvedSet Protected Running above 0
Stop a noisy pool from filling all slotsSet Max Running
Avoid one visit launching many small requestsSet Max Burst
Keep default behaviorLeave Max Running and Max Burst blank

Protected capacity is not wasted. If the protected pool has no backlog, other pools may borrow those slots.

Predefined Pool Profiles

Studio includes an Apply Pool Profile action on the LLM Scheduler page. It previews the selected profile before applying it, then creates or updates only the listed pools and keeps the default pool as the catch-all lane.

Document S/M/L

Use this when producers can classify document work by size before enqueueing.

PoolWeightProtected RunningMax RunningMax BurstUse when
document-small4142Short text-first documents should move quickly.
document-medium2131Normal extraction and classification work needs steady throughput.
document-large1011Large, image-heavy, or expensive documents should be paced.

Interactive + Backfill

Use this when operator-facing work should stay responsive while background jobs continue to make progress.

PoolWeightProtected RunningMax RunningMax BurstUse when
interactive41BlankBlankChat, operator actions, or short agent turns need priority.
backfill1021Replay, reprocessing, or non-urgent enrichment should keep draining without filling every slot.

Document Pipeline

Use this when producers know the document-processing stage before enqueueing.

PoolWeightProtected RunningMax RunningMax BurstUse when
document-extract3131Extraction gets steady capacity for higher-cost model calls.
document-validate2121Correction and retry work should not block extraction.
document-enrich1021Optional or downstream prompts should be paced.

Requests without a size-specific, stage-specific, or workload-specific pool should still flow into default.

The default pool remains useful for unclassified, legacy, manual, or fallback work. It is not a wildcard over every possible pool name: producers should route unmatched work to default before enqueueing.

Starter Configurations

Single default pool

Use this when all LLM work can share one lane.

PoolWeightProtected RunningMax RunningMax Burst
default10BlankBlank

Interactive plus backfill

Use this when operator-facing work should stay responsive while background jobs continue to make progress.

PoolWeightProtected RunningMax RunningMax Burst
interactive41BlankBlank
backfill1021
default10BlankBlank

Document extraction lane

Use this when document or image-heavy requests have higher scheduler cost.

PoolWeightProtected RunningMax RunningMax Burst
document-extract2131
interactive41BlankBlank
default10BlankBlank

Operational Notes

  • Keep the default pool enabled as the catch-all lane.
  • Increase weight to change relative share.
  • Use protected running for service guarantees.
  • Use max running to cap noisy pools.
  • Use endpoint URL only when a pool should use a different backend from the gateway default.
  • Watch the LLM Dispatch Runtime page to confirm queue depth, in-flight work, latency, and failures.
Last updated on