scx.ai logo

Engineering Blog

Achieving 1,200+ Tokens/Sec: Optimising Inference Pipelines

Deep dive into our inference optimisation techniques that deliver consistent sub-100ms latency at scale.

By SCX.ai Engineering Team10 min read

The Inference Challenge

Serving large language models at scale requires balancing three competing objectives:

  • Throughput: Tokens generated per second across all requests
  • Latency: Time to first token and inter-token delay
  • Cost: Compute resources per token

Optimising one often degrades another. This post details how we achieve 1,200+ tokens/sec while maintaining sub-100ms latency.

Architecture Overview

Our inference pipeline consists of:

  1. Request Router: Intelligent load balancing with latency-aware scheduling
  2. Batch Aggregator: Dynamic batching with deadline-aware grouping
  3. Inference Engine: Optimised model execution with continuous batching
  4. KV Cache Manager: Efficient attention state management
  5. Response Streamer: Token-by-token delivery with minimal overhead

Key Optimisations

1. Continuous Batching

Traditional batching waits for a batch to fill before processing. Continuous batching:

  • Processes requests as they arrive
  • Adds new requests to in-flight batches
  • Removes completed requests immediately

Result: 40% higher throughput with lower average latency.

2. PagedAttention for KV Cache

Attention key-value caches grow linearly with sequence length. PagedAttention:

  • Allocates memory in fixed-size blocks
  • Enables memory sharing across requests
  • Reduces memory fragmentation

Result: 2-4× more concurrent requests per GPU.

3. Speculative Decoding

For predictable text patterns, speculative decoding:

  • Generates multiple candidate tokens in parallel
  • Verifies candidates with the full model
  • Accepts valid candidates, discards incorrect ones

Result: 2-3× speedup for structured outputs (JSON, code).

4. Quantisation Without Quality Loss

INT8 and FP8 quantisation reduces memory and compute requirements:

  • Weight quantisation: 50% memory reduction
  • Activation quantisation: Additional compute savings
  • Calibration: Maintains output quality within 0.1% of FP16

Result: 2× throughput on the same hardware.

5. Prefix Caching

Many requests share common prefixes (system prompts, few-shot examples). Prefix caching:

  • Pre-computes attention states for common prefixes
  • Shares cached states across requests
  • Reduces redundant computation

Result: 30-50% latency reduction for repeated patterns.

Latency Breakdown

For a typical 100-token generation at sub-100ms:

StageTime
Request routing2ms
Batch scheduling3ms
Prefill (first token)25ms
Generation (99 tokens)65ms
Response streaming5ms
Totalsub-100ms

Monitoring and Tuning

We instrument every stage with:

  • P50/P95/P99 latency tracking
  • Queue depth monitoring
  • Cache hit rates
  • Batch size distribution

Automated tuning adjusts:

  • Batch timeout thresholds
  • Maximum batch sizes
  • Memory allocation strategies
  • Request routing weights

Hardware Considerations

Optimisations must match hardware capabilities:

Memory Bandwidth

LLM inference is memory-bound. We prioritise:

  • High-bandwidth memory (HBM)
  • Optimal tensor layouts
  • Minimal data movement

Compute Precision

Mixed precision execution:

  • FP16 for most operations
  • INT8 for compatible layers
  • FP32 only where numerically necessary

Scaling Horizontally

Beyond single-node optimisation:

  • Model parallelism: Split large models across devices
  • Pipeline parallelism: Overlap computation and communication
  • Replica load balancing: Distribute requests across instances

Conclusion

Achieving 1,200+ tokens/sec with consistent sub-100ms latency requires optimisation at every layer—from request routing to memory management. The techniques described here compound: each improvement enables further gains.

For technical discussions about inference optimisation, reach out to info@scx.ai.

Related Topics

inference optimisationtokens per secondlatencythroughputLLM servingbatchingKV cache
Achieving 1,200+ Tokens/Sec: Optimising Inference Pipelines