Adaptive Computation in the LLM Era : A Unified Survey of Routing, Cascades, and Test-Time Scaling

21 June 2026, Version 1
This content is an early or alternative research output and has not been peer-reviewed by Cambridge University Press at the time of posting.

Abstract

Large language model deployments increasingly allocate computation at inference time rather than applying a single fixed model and decoding policy to every input. The resulting design problem is not only which model is best on average, but which computational action should be taken for a particular query, partial answer, reasoning step, or token under a budget. Research on this problem is fragmented across model routing, confidence-gated cascades, selective prediction, test-time scaling, verifier-guided search, speculative decoding, and token- or layer-level architectural adaptivity. This survey unifies these strands as adaptive computation: budgeted sequential decision-making over computational actions on a quality-cost frontier. I provide a structured review protocol, a taxonomy by allocation granularity and decision signal, a formal mapping of routing and cascades to special cases of sequential decision-making, and evaluation conventions for reporting tokens, dollars, FLOPs, latency, and decision overhead. A normalized audit of 15 representative systems and method families indicates that adaptive policies are most credible when the decision signal is substantially cheaper than the action it avoids and is calibrated near the deployment threshold. The audit also shows why many headline savings are not directly comparable: router calls, verifier calls, draft-model FLOPs, rejected samples, price snapshots, and queueing latency are often treated inconsistently. I close with open problems in step-level deferral with guarantees, calibration under distribution shift, effort prediction for reasoning models, routing over models and inference configurations, and inference-compute economics.

Keywords

adaptive computation
model routing
LLM cascades
test-time scaling
inference-time compute
selective prediction
speculative decoding
cost accounting
efficiency

Supplementary materials

Title
Description
Actions
Title
Coding sheet — supplementary material
Description
The supplementary coding sheet referenced in Section 2 and the audit in Section 9.3 / Appendix A. It operationalizes the review protocol and coding schema in Table 2 and records the per-entry coding behind Table 6.
Actions

Comments

Comments are not moderated before they are posted, but they can be removed by the site moderators if they are found to be in contravention of our Commenting and Discussion Policy [opens in a new tab] - please read this policy before you post. Comments should be used for scholarly discussion of the content in question. You can find more information about how to use the commenting feature here [opens in a new tab] .
This site is protected by reCAPTCHA and the Google Privacy Policy [opens in a new tab] and Terms of Service [opens in a new tab] apply.