Nimbus: Burst-Resilient Hybrid Inference for LLMs

LLM traffic spikes force teams to either overprovision GPUs or miss SLOs. Nimbus predicts TTFT from queue state and uses a budget-aware (knapsack-style) policy to offload the right requests to APIs.