Engineering Blog
Achieving 1,200+ Tokens/Sec: Optimising Inference Pipelines
Deep dive into our inference optimisation techniques that deliver consistent sub-100ms latency at scale.
The Inference Challenge
Serving large language models at scale requires balancing three competing objectives:
- Throughput: Tokens generated per second across all requests
- Latency: Time to first token and inter-token delay
- Cost: Compute resources per token
Optimising one often degrades another. This post details how we achieve 1,200+ tokens/sec while maintaining sub-100ms latency.
Architecture Overview
Our inference pipeline consists of:
- Request Router: Intelligent load balancing with latency-aware scheduling
- Batch Aggregator: Dynamic batching with deadline-aware grouping
- Inference Engine: Optimised model execution with continuous batching
- KV Cache Manager: Efficient attention state management
- Response Streamer: Token-by-token delivery with minimal overhead
Key Optimisations
1. Continuous Batching
Traditional batching waits for a batch to fill before processing. Continuous batching:
- Processes requests as they arrive
- Adds new requests to in-flight batches
- Removes completed requests immediately
Result: 40% higher throughput with lower average latency.
2. PagedAttention for KV Cache
Attention key-value caches grow linearly with sequence length. PagedAttention:
- Allocates memory in fixed-size blocks
- Enables memory sharing across requests
- Reduces memory fragmentation
Result: 2-4× more concurrent requests per GPU.
3. Speculative Decoding
For predictable text patterns, speculative decoding:
- Generates multiple candidate tokens in parallel
- Verifies candidates with the full model
- Accepts valid candidates, discards incorrect ones
Result: 2-3× speedup for structured outputs (JSON, code).
4. Quantisation Without Quality Loss
INT8 and FP8 quantisation reduces memory and compute requirements:
- Weight quantisation: 50% memory reduction
- Activation quantisation: Additional compute savings
- Calibration: Maintains output quality within 0.1% of FP16
Result: 2× throughput on the same hardware.
5. Prefix Caching
Many requests share common prefixes (system prompts, few-shot examples). Prefix caching:
- Pre-computes attention states for common prefixes
- Shares cached states across requests
- Reduces redundant computation
Result: 30-50% latency reduction for repeated patterns.
Latency Breakdown
For a typical 100-token generation at sub-100ms:
| Stage | Time |
|---|---|
| Request routing | 2ms |
| Batch scheduling | 3ms |
| Prefill (first token) | 25ms |
| Generation (99 tokens) | 65ms |
| Response streaming | 5ms |
| Total | sub-100ms |
Monitoring and Tuning
We instrument every stage with:
- P50/P95/P99 latency tracking
- Queue depth monitoring
- Cache hit rates
- Batch size distribution
Automated tuning adjusts:
- Batch timeout thresholds
- Maximum batch sizes
- Memory allocation strategies
- Request routing weights
Hardware Considerations
Optimisations must match hardware capabilities:
Memory Bandwidth
LLM inference is memory-bound. We prioritise:
- High-bandwidth memory (HBM)
- Optimal tensor layouts
- Minimal data movement
Compute Precision
Mixed precision execution:
- FP16 for most operations
- INT8 for compatible layers
- FP32 only where numerically necessary
Scaling Horizontally
Beyond single-node optimisation:
- Model parallelism: Split large models across devices
- Pipeline parallelism: Overlap computation and communication
- Replica load balancing: Distribute requests across instances
Conclusion
Achieving 1,200+ tokens/sec with consistent sub-100ms latency requires optimisation at every layer—from request routing to memory management. The techniques described here compound: each improvement enables further gains.
For technical discussions about inference optimisation, reach out to info@scx.ai.