Adaptive Computation in the LLM Era : A Unified Survey of Routing, Cascades, and Test-Time Scaling

John Cheung

doi:10.33774/coe-2026-6fbrv

Large language model deployments increasingly allocate computation at inference time rather than applying a single fixed model and decoding policy to every input. The resulting design problem is not only which model is best on average, but which computational action should be taken for a particular query, partial answer, reasoning step, or token under a budget. Research on this problem is fragmented across model routing, confidence-gated cascades, selective prediction, test-time scaling, verifier-guided search, speculative decoding, and token- or layer-level architectural adaptivity. This survey unifies these strands as adaptive computation: budgeted sequential decision-making over computational actions on a quality-cost frontier. I provide a structured review protocol, a taxonomy by allocation granularity and decision signal, a formal mapping of routing and cascades to special cases of sequential decision-making, and evaluation conventions for reporting tokens, dollars, FLOPs, latency, and decision overhead. A normalized audit of 15 representative systems and method families indicates that adaptive policies are most credible when the decision signal is substantially cheaper than the action it avoids and is calibrated near the deployment threshold. The audit also shows why many headline savings are not directly comparable: router calls, verifier calls, draft-model FLOPs, rejected samples, price snapshots, and queueing latency are often treated inconsistently. I close with open problems in step-level deferral with guarantees, calibration under distribution shift, effort prediction for reasoning models, routing over models and inference configurations, and inference-compute economics.

Adaptive Computation in the LLM Era : A Unified Survey of Routing, Cascades, and Test-Time Scaling

Abstract

Keywords

Supplementary materials

Comments

Version History

Metrics

License

DOI

Author’s competing interest statement

Ethics

Share

Adaptive Computation in the LLM Era : A Unified Survey of Routing, Cascades, and Test-Time Scaling

Authors

Abstract

Keywords

Supplementary materials

Comments

Version History

Metrics

License

DOI

Author’s competing interest statement

Ethics

Share