Aurora

When RL Meets Adaptive Speculative Training: A Unified Training-Serving System

1Together AI    2University of Sydney    3University of Illinois Urbana-Champaign
Aurora system in action

Aurora in Action: Real-time demonstration of the unified training-serving system

Aurora training-inference framework

Aurora Training-Inference Framework: Online speculative training with asynchronous RL-style updates. A production inference server runs speculative decoding with a fixed target (verifier) and a lightweight draft (speculator), where proposed tokens are accepted or rejected during verification.

1.48×
Speedup from Static
1.33×
Day-0 Speedup
1.25×
Over Trained Baseline
Day-0
Deployment Ready

Abstract

Speculative decoding can significantly accelerate LLM serving, but its real-world benefits often erode due to training-serving mismatch and non-stationary traffic. Unlike previous systems that decouple speculator training from inference, we present a unified training-serving system, Aurora, that closes this loop by continuously learning a speculator model directly from live inference traces. Our design integrates an SGLang-based inference server with an asynchronous training server connected via efficient GPU-to-GPU RPC, enabling hot-swapped speculator updates without service interruption.

Crucially, our system supports day-0 deployment: a speculator can be served immediately and quickly adapted on live traffic, improving overall system throughput. This paradigm shift enables us to frame the training-serving loop as an asynchronous reinforcement learning process and allows us to leverage rejected tokens from the speculator to improve sampling efficiency. Our experiments show that this unified system achieves a 1.33× speedup in the mixed-data scenario when starting from scratch speculator, and a 1.48× speedup starting from static speculator. We also find that the system adapts more effectively to distribution shifts in user traffic, delivering a 1.25× speedup over a well-trained but static speculator.

Motivation

Most speculative decoding deployments follow a two-stage pipeline: offline training and separate serving. This separation introduces systematic challenges that limit real-world performance:

1. Training-Serving Mismatch Gap

Offline drafter training optimizes acceptance in a controlled setting, but production speedups depend on deployment stack details—kernel implementations, numeric precision (FP8/FP4), batching, and scheduling. A drafter that appears strong offline may realize materially lower acceptance once integrated into production.

2. Moving-Target Gap from Verifier Drift

Target models are updated frequently for quality, safety, cost, or hardware migration. However, the drafter is typically refreshed on a slower cadence due to retraining cost and pipeline complexity. This creates significant staleness and degrades speculative performance.

3. High Infrastructure Cost

Off-policy distillation pipelines often require collecting large volumes of target model activations or signals. At production scale, the storage footprint can reach petabyte-level magnitude, with high cost in memory, bandwidth, and operational complexity.

System Architecture

Aurora consists of two primary, decoupled components:

Inference Server

Runs an SGLang speculative decoding engine with a target model and a draft model. For each request, the draft model proposes a sequence of tokens, which are then verified in parallel by the target model. The outcomes—both accepted and rejected tokens—are streamed to a distributed data buffer.

Training Server

Operates asynchronously, fetching batches of training data from the data buffer and performing gradient updates on a copy of the draft model. Once a new, improved speculator is ready, its weights are asynchronously pushed back to the Inference Server via hot-swap, without downtime or service interruption.

This decoupled, asynchronous architecture allows the training process to run continuously on dedicated resources while the inference process remains highly responsive and optimized for low-latency serving.

Aurora system architecture

Method

Online Speculator Training as Asynchronous RL

We view online speculative decoding as an asynchronous reinforcement learning system. The draft model acts as the policy π, and the target model plus verifier implement the environment. Each speculative step forms a short episode: the policy proposes a tree of candidate continuations, the verifier accepts a prefix, and the outcome provides structured feedback. Accepted tokens correspond to positive reward, while rejected tokens provide zero/negative reward.

Learning from Acceptance and Rejection

Verification yields richer supervision than acceptance-only imitation. We train the draft model with two complementary signals:

  • Acceptance loss (imitation): Cross-entropy on accepted tokens, encouraging the draft to reproduce verifier-approved continuations.
  • Rejection loss (counterfactual feedback): Rejected branches specify what the policy should not propose. With Discard Sampling, we apply a KL-based objective that pushes probability mass away from incorrect predictions.

The total loss is a weighted combination:

L = Ex~paccept[KL(ptarget || pdraft)] + λdiscard Ex~pdiscard[KL(ptarget || pdraft)]

Efficient Tree Attention

We employ a specialized Tree Attention mechanism to efficiently process the complex branching structure of speculative decoding results. By constructing a custom attention mask that respects the causal structure of the speculative tree, we can process all accepted and rejected branches in a single batched forward and backward pass.

Tree attention mechanism

Tree Attention mechanism for efficient processing of speculative decoding branches

Experimental Results

Day-0 Deployment: Training from Scratch

Aurora enables day-0 serving: a completely untrained speculator can be deployed immediately and become production-ready through online adaptation alone. In mixed traffic scenarios, acceptance length reaches competitive levels rapidly.

From scratch acceptance length

Acceptance length improves rapidly from random initialization

From scratch throughput

Throughput stabilizes at 302.3 tokens/s

Adaptation to Distribution Shift

When requests are grouped by domain to induce abrupt distribution changes, Aurora adapts continuously. The system recovers acceptance length within approximately 10,000 requests after each shift.

Domain shift acceptance

Aurora adapts to domain shifts, recovering performance

Domain shift throughput

Throughput maintains competitiveness despite domain shifts

Training with Existing Speculator

Starting from a trained speculator, Aurora achieves 1.48× speedup over the static baseline through continuous adaptation.

Trained spec acceptance

Pre-trained speculator under domain shift

Trained spec throughput

1.48× speedup over static speculator

Batch size ablation acceptance

Batch size ablation: acceptance length

Batch size ablation throughput

Batch size ablation: throughput

Coding Domain Performance

Aurora also demonstrates strong performance on coding-specific workloads across different model families.

Coding benchmark Qwen

Coding benchmark results (Qwen3-8B)

Coding benchmark LLaMA

Coding benchmark results (LLaMA 3.1-8B)

Additional results

Coding domain acceptance trend

Why Aurora?

Traditional speculative decoding suffers from training-serving mismatch. Aurora solves this with a unified, continuously adaptive system.

Closes the Training-Serving Loop
Online data collection and domain-controlled buffer substantially reduce distribution shift between training and serving.
Day-0 Deployment
Deploy a cold-start speculator immediately and begin collecting on-policy serving data, adapting in situ from day one.
Reduced Memory & Cost
Stabilizing speculator behavior under the true serving distribution reduces effective serving footprint and memory cost.
Algorithm Agnostic
Interfaces cleanly with a broad family of speculative decoding variants, making the unified training-serving loop a drop-in enhancement.
Asynchronous Updates
Hot-swapped speculator updates without service interruption via efficient GPU-to-GPU RPC communication.
Open Source
Built on SGLang inference engine with full implementation available for the research community.

Key Finding: Online training from scratch can exceed the performance of a carefully pretrained speculator, fundamentally challenging the conventional wisdom that speculative decoding requires extensive offline pretraining.

Conclusion

We presented Aurora, a unified training-serving system that reframes speculative decoding as a joint learning-and-serving problem. By connecting an SGLang-based inference server with an asynchronous training server via GPU-aware RPC, Aurora enables continuous on-policy adaptation of the draft model under live traffic, closing the training-serving mismatch that limits conventional two-stage pipelines.

Our experiments show that simple online fine-tuning captures most attainable gains, that lazy synchronization best balances adaptation speed with serving stability, and that day-0 deployment from scratch is practical—an untrained speculator reaches competitive acceptance rates within thousands of requests, eliminating the offline pretraining bottleneck for onboarding new models.

BibTeX

@article{aurora2026,
  title={When RL Meets Adaptive Speculative Training: A Unified Training-Serving System},
  author={Bie, Fengxiang and Wang, Junxiong and Li, Jisen and Zhou, Zhongzhu and Xu, Chenfeng and Shao, Zelei and Wang, Yubo and Liu, Yinghui and Wu, Qingyang and May, Avner and Athiwaratkun, Ben and Zhang, Yineng and Song, Shuaiwen Leon and Wu, Xiaoxia},
  journal={International Conference on Machine Learning (ICML)},
  year={2026},
  note={Under review},
  url={https://aurora-spec.github.io}
}