Nimbus: Burst-Resilient Hybrid Inference for LLMs
LLM traffic spikes force teams to either overprovision GPUs or miss SLOs. Nimbus predicts TTFT from queue state and uses a budget-aware (knapsack-style) policy to offload the right requests to APIs.
- Queue-state TTFT predictor for admission/offload decisions.
- Cost/SLO-aware routing over a compute budget.
- Deployed in production settings with 25% cost savings under bursty traffic.
