How to Cut LLM Inference Costs with vLLM and Speculative Decoding
The Cost Problem Nobody Talks About
You're probably running LLMs in production right now. And if you are, you're overpaying by a factor of 10.
I'm not exaggerating. A typical inference setup—using standard transformers or even basic vLLM—wastes 60-70% of compute just moving data around. The actual reasoning? That's maybe 30% of your bill.
Let me show you the numbers.
The Token Generation Bottleneck
When you call an LLM API (OpenAI, Anthropic, locally hosted), here's what happens under the hood:
- Prefill phase: Process your prompt (100 tokens) → takes 10ms, uses 1 GPU batch
- Generation phase: Generate output (1 token at a time) → takes 100ms for 100 tokens, uses 100 GPU batches
See the problem? Each token generated requires a full forward pass through the model. For a 100-token output, that's 100 separate GPU operations, each with massive memory overhead.
This is called the memory-bound bottleneck. Your GPU spends 99% of time moving weights from HBM (high-bandwidth memory) to compute units, and 1% actually computing.
Real cost impact:
- Naive inference: 100 tokens costs you $0.15
- vLLM (PagedAttention): Same 100 tokens costs $0.045
- vLLM + Speculative Decoding: Same 100 tokens costs $0.015
That's 90% cost reduction for the same quality output. This isn't theoretical—it's deployed on NVIDIA H100s in production right now.
If you want the broader architecture view of where this money goes inside an agent stack, read Token Economics: Why Your Agent Architecture Is Costing 10x More Than It Should.
How PagedAttention Cuts Memory Waste
vLLM's breakthrough was simple but profound: Stop treating KV cache as a contiguous block.
In standard transformers, when you generate tokens, you store the full attention key-value (KV) cache for every previous token. This creates massive memory fragmentation.
Standard Attention Layout:
[Token1_KV] [Token2_KV] [Token3_KV] ... [TokenN_KV] [WASTED SPACE] [WASTED SPACE]
With PagedAttention:
vLLM PagedAttention Layout:
[Block1: Tokens1-16] → [Block2: Tokens17-32] → [Block3: Tokens33-48]
Benefit: 95% less memory waste. You can now batch 40 sequences where you could only batch 4 before.
Real numbers from vLLM's GPU benchmarks:
- Batch size 1: 14ms latency
- Batch size 40 (with PagedAttention): 20ms latency per sequence
That's continuous batching. While one request generates its 100th token, 39 others are prefilling. Result: 7x throughput, same hardware.
Speculative Decoding: Generate Multiple Tokens Per GPU Cycle
Now here's where it gets spicy.
Remember the problem: Each token requires one forward pass. That's a hard constraint, right?
Wrong. In 2025-2026, researchers realized: You don't need the same model to verify tokens.
Speculative decoding works like this:
- Small model (3-7B) generates 4 candidate tokens in parallel (speculative)
- Large model (35B-70B) verifies all 4 in a single forward pass
- If all pass verification: You just got 4 tokens for the cost of 1.3 large model passes
- If some fail: Rewind and re-generate with large model only
The math:
- Standard generation: 100 tokens = 100 large model passes
- Speculative: 100 tokens = 25 large model passes + 100 small model passes
- Cost: Speculative is 35% the cost of standard generation
This is now in vLLM (TorchSpec integration), LMDeploy, and deployed at scale.
Real Production Example
Let's do the math for a typical startup serving Claude-level AI:
Scenario: 100 users, 200 requests/day, 500 tokens output average
Standard inference (naive transformers):
- Daily tokens: 100,000 tokens
- Cost per 1M tokens: $0.30 (typical A100 hardware cost)
- Daily cost: $3
vLLM with PagedAttention:
- Same throughput, better batching
- Daily cost: $1 (70% reduction)
vLLM + Speculative Decoding:
- Daily cost: $0.30 (90% reduction)
Annual savings: ~$1,000 for this tiny setup.
For a company like Perplexity (millions of requests/day), this is $10M-100M annually.
The Setup: vLLM + Speculative Decoding in 5 Minutes
Here's the actual code. It's embarrassingly simple.
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
# Initialize with speculative decoding
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
tensor_parallel_size=2, # 2 GPUs
dtype="float16",
max_model_len=4096,
# Speculative decoding config
speculative_model="meta-llama/Llama-2-7b-hf", # Small draft model
num_speculative_tokens=4, # Generate 4 tokens at a time
use_v2_block_manager=True, # PagedAttention (vLLM 0.4.0+)
)
# Generate
prompts = ["Tell me about inference optimization..."] * 100
sampling_params = SamplingParams(temperature=0.7, max_tokens=500)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
That's it. No custom CUDA kernels, no black magic. You get 70-90% cost reduction out of the box.
Cost Comparison: The Honest Math
Let's compare across different serving scenarios:
| Setup | Cost/1M Tokens | Latency | Throughput |
|---|---|---|---|
| GPT-4 API | $30 | 500ms | 1 req/s |
| Naive vLLM (H100) | $0.80 | 150ms | 50 req/s |
| vLLM + PagedAttention | $0.24 | 160ms | 250 req/s |
| vLLM + Speculative | $0.08 | 175ms | 350 req/s |
| OpenAI o1 (reasoning) | $60 | 2000ms | 0.5 req/s |
The secret: You don't need to choose between cost and quality. Better algorithms win on both dimensions.
Why Nobody's Doing This (Yet)
If this is so good, why isn't every startup using it?
Operational friction: Adding speculative decoding adds debugging complexity. When inference is slow, is it the draft model? The verification? The batching?
Draft model tuning: Speculative decoding only works if your small model is accurate enough. If your 7B model generates garbage, the large model rejects everything, and you lose the speedup.
Latency variance: Speculative decoding introduces non-deterministic latency. Sometimes you accept all tokens (175ms), sometimes you reject and regenerate (250ms). Production systems hate variance.
VRAM pressure: You need both models loaded. A 70B + 7B setup needs ~80GB VRAM. That's 2x H100s, not 1.
Real team feedback from Reddit, HN, and production deployments:
- "PagedAttention is magical, we're saving 60% on costs"
- "Speculative decoding is finicky, debugging is a nightmare"
- "We tried it, gave up after 2 weeks, sticking with naive vLLM"
The Decision Tree: When to Use What
Use naive vLLM if:
- You're getting started
- Latency variance is acceptable
- You don't have 2+ GPUs for speculative decoding
Use vLLM + PagedAttention if:
- You're optimizing for throughput
- You have consistent batch sizes (50+)
- Cost is becoming a problem
Use vLLM + Speculative if:
- You have multiple GPUs
- Your workload is latency-insensitive (batch processing)
- Cost optimization is critical
- You have time to debug verification failures
The Hidden Cost Nobody Mentions: Monitoring
Here's what happens after you deploy speculative decoding:
[Day 1] "Great! Costs down 70%, shipping!"
[Day 8] "Why is inference variance increasing?"
[Day 15] "Some outputs are garbage, debug mode shows 30% verification failure"
[Day 30] "Reverted to naive vLLM, costs are back up"
You need observability for:
- Token acceptance rate: What % of speculative tokens did the large model accept?
- Verification latency: How long does token verification take?
- Draft model accuracy: Is your small model still accurate?
Without this, you're flying blind.
Practical Takeaway
Your inference setup is burning money. Switch to vLLM with PagedAttention by end of this week. That's a 70% cost reduction with zero code changes.
Speculative decoding is next. It requires more tuning but the ROI is massive if your workload allows it.
And here's the really spicy part: Frontier model providers (OpenAI, Anthropic, Google) know this. They've already optimized their inference to death. Your $30/1M token bill for GPT-4 includes maybe $2 in actual compute cost. The rest is R&D amortization and profit margin.
By running your own inference with vLLM, you're not just saving money. You're catching up to what the frontier labs figured out in 2023-2024.
The gap is closing. And you should close it too.
FAQ
Q: Will speculative decoding hurt output quality?
A: No. Speculative decoding only accepts tokens that the large model would have generated anyway. Quality is identical.
Q: How much VRAM do I need?
A: For 70B model: ~40GB. For 70B + 7B speculative: ~50GB. Two H100s (80GB each) are overkill but stable.
Q: Can I use open models or just frontier models?
A: Open models work great. Llama 2 70B + 7B is a common combo and very efficient.
Q: What about batching with users in different regions?
A: vLLM's continuous batching handles this automatically. You can mix requests with 10-5000 token outputs in the same batch.
Q: Is vLLM production-ready?
A: Yes. 77K GitHub stars, used by Alibaba, Google Cloud, AWS. It's the standard now.