Back to all posts

How to Cut LLM Inference Costs with vLLM and Speculative Decoding

Mon Apr 20 2026 • Birat Gautam
LLMsInfrastructureCost OptimizationvLLMProduction ML
Inference OptimizationToken EfficiencyMLOps

Difficulty: Intermediate

The Cost Problem Nobody Talks About

You're probably running LLMs in production right now. And if you are, you're overpaying by a factor of 10.

I'm not exaggerating. A typical inference setup—using standard transformers or even basic vLLM—wastes 60-70% of compute just moving data around. The actual reasoning? That's maybe 30% of your bill.

Let me show you the numbers.

The Token Generation Bottleneck

When you call an LLM API (OpenAI, Anthropic, locally hosted), here's what happens under the hood:

  1. Prefill phase: Process your prompt (100 tokens) → takes 10ms, uses 1 GPU batch
  2. Generation phase: Generate output (1 token at a time) → takes 100ms for 100 tokens, uses 100 GPU batches

See the problem? Each token generated requires a full forward pass through the model. For a 100-token output, that's 100 separate GPU operations, each with massive memory overhead.

This is called the memory-bound bottleneck. Your GPU spends 99% of time moving weights from HBM (high-bandwidth memory) to compute units, and 1% actually computing.

Real cost impact:

That's 90% cost reduction for the same quality output. This isn't theoretical—it's deployed on NVIDIA H100s in production right now.

If you want the broader architecture view of where this money goes inside an agent stack, read Token Economics: Why Your Agent Architecture Is Costing 10x More Than It Should.

How PagedAttention Cuts Memory Waste

vLLM's breakthrough was simple but profound: Stop treating KV cache as a contiguous block.

In standard transformers, when you generate tokens, you store the full attention key-value (KV) cache for every previous token. This creates massive memory fragmentation.

Standard Attention Layout:
[Token1_KV] [Token2_KV] [Token3_KV] ... [TokenN_KV] [WASTED SPACE] [WASTED SPACE]

With PagedAttention:

vLLM PagedAttention Layout:
[Block1: Tokens1-16] → [Block2: Tokens17-32] → [Block3: Tokens33-48]

Benefit: 95% less memory waste. You can now batch 40 sequences where you could only batch 4 before.

Real numbers from vLLM's GPU benchmarks:

That's continuous batching. While one request generates its 100th token, 39 others are prefilling. Result: 7x throughput, same hardware.

Speculative Decoding: Generate Multiple Tokens Per GPU Cycle

Now here's where it gets spicy.

Remember the problem: Each token requires one forward pass. That's a hard constraint, right?

Wrong. In 2025-2026, researchers realized: You don't need the same model to verify tokens.

Speculative decoding works like this:

  1. Small model (3-7B) generates 4 candidate tokens in parallel (speculative)
  2. Large model (35B-70B) verifies all 4 in a single forward pass
  3. If all pass verification: You just got 4 tokens for the cost of 1.3 large model passes
  4. If some fail: Rewind and re-generate with large model only

The math:

This is now in vLLM (TorchSpec integration), LMDeploy, and deployed at scale.

Real Production Example

Let's do the math for a typical startup serving Claude-level AI:

Scenario: 100 users, 200 requests/day, 500 tokens output average

Standard inference (naive transformers):

vLLM with PagedAttention:

vLLM + Speculative Decoding:

Annual savings: ~$1,000 for this tiny setup.

For a company like Perplexity (millions of requests/day), this is $10M-100M annually.

The Setup: vLLM + Speculative Decoding in 5 Minutes

Here's the actual code. It's embarrassingly simple.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

# Initialize with speculative decoding
llm = LLM(
    model="meta-llama/Llama-2-70b-hf",
    tensor_parallel_size=2,  # 2 GPUs
    dtype="float16",
    max_model_len=4096,
    # Speculative decoding config
    speculative_model="meta-llama/Llama-2-7b-hf",  # Small draft model
    num_speculative_tokens=4,  # Generate 4 tokens at a time
    use_v2_block_manager=True,  # PagedAttention (vLLM 0.4.0+)
)

# Generate
prompts = ["Tell me about inference optimization..."] * 100
sampling_params = SamplingParams(temperature=0.7, max_tokens=500)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

That's it. No custom CUDA kernels, no black magic. You get 70-90% cost reduction out of the box.

Cost Comparison: The Honest Math

Let's compare across different serving scenarios:

Setup Cost/1M Tokens Latency Throughput
GPT-4 API $30 500ms 1 req/s
Naive vLLM (H100) $0.80 150ms 50 req/s
vLLM + PagedAttention $0.24 160ms 250 req/s
vLLM + Speculative $0.08 175ms 350 req/s
OpenAI o1 (reasoning) $60 2000ms 0.5 req/s

The secret: You don't need to choose between cost and quality. Better algorithms win on both dimensions.

Why Nobody's Doing This (Yet)

If this is so good, why isn't every startup using it?

  1. Operational friction: Adding speculative decoding adds debugging complexity. When inference is slow, is it the draft model? The verification? The batching?

  2. Draft model tuning: Speculative decoding only works if your small model is accurate enough. If your 7B model generates garbage, the large model rejects everything, and you lose the speedup.

  3. Latency variance: Speculative decoding introduces non-deterministic latency. Sometimes you accept all tokens (175ms), sometimes you reject and regenerate (250ms). Production systems hate variance.

  4. VRAM pressure: You need both models loaded. A 70B + 7B setup needs ~80GB VRAM. That's 2x H100s, not 1.

Real team feedback from Reddit, HN, and production deployments:

The Decision Tree: When to Use What

Use naive vLLM if:

Use vLLM + PagedAttention if:

Use vLLM + Speculative if:

The Hidden Cost Nobody Mentions: Monitoring

Here's what happens after you deploy speculative decoding:

[Day 1] "Great! Costs down 70%, shipping!"
[Day 8] "Why is inference variance increasing?"
[Day 15] "Some outputs are garbage, debug mode shows 30% verification failure"
[Day 30] "Reverted to naive vLLM, costs are back up"

You need observability for:

Without this, you're flying blind.

Practical Takeaway

Your inference setup is burning money. Switch to vLLM with PagedAttention by end of this week. That's a 70% cost reduction with zero code changes.

Speculative decoding is next. It requires more tuning but the ROI is massive if your workload allows it.

And here's the really spicy part: Frontier model providers (OpenAI, Anthropic, Google) know this. They've already optimized their inference to death. Your $30/1M token bill for GPT-4 includes maybe $2 in actual compute cost. The rest is R&D amortization and profit margin.

By running your own inference with vLLM, you're not just saving money. You're catching up to what the frontier labs figured out in 2023-2024.

The gap is closing. And you should close it too.

FAQ

Q: Will speculative decoding hurt output quality?
A: No. Speculative decoding only accepts tokens that the large model would have generated anyway. Quality is identical.

Q: How much VRAM do I need?
A: For 70B model: ~40GB. For 70B + 7B speculative: ~50GB. Two H100s (80GB each) are overkill but stable.

Q: Can I use open models or just frontier models?
A: Open models work great. Llama 2 70B + 7B is a common combo and very efficient.

Q: What about batching with users in different regions?
A: vLLM's continuous batching handles this automatically. You can mix requests with 10-5000 token outputs in the same batch.

Q: Is vLLM production-ready?
A: Yes. 77K GitHub stars, used by Alibaba, Google Cloud, AWS. It's the standard now.