Back to all posts

The Latency Trap: Why 99th-Percentile Response Time Matters More Than Average

Thu Apr 16 2026 • Birat Gautam
Agentic AIPerformanceProduction SystemsObservabilitySRE

Difficulty: Intermediate

Why averages fail in agent systems

An agent that averages 2 seconds per request can still feel broken.

That is not a paradox. It is a distribution problem.

Agentic systems are composed of retrieval, tool calls, branching, retries, and fallback paths. That makes latency heavy-tailed. A few slow requests do not just skew the mean; they define what users remember.

A simple shape of reality

flowchart LR
    A[Fast path\ncache hit + short prompt] --> B[2 to 4 seconds]
    C[Cold path\nretrieval + full reasoning] --> D[6 to 12 seconds]
    E[Failure path\ntimeout + retry + fallback] --> F[20 to 60 seconds]

The mean hides the failure path. p95 and p99 expose it.

What the percentiles tell you

If your p99 jumps from 4 seconds to 18 seconds, the product changed even if the average barely moved.

Where the tail comes from

The slowest requests usually come from one of four places.

  1. Cold starts in the model or runtime.
  2. Long retrieval chains across many documents.
  3. Retry storms when a tool or API starts failing.
  4. Overlong prompts that force the model to process too much context.
flowchart TB
    subgraph Request
        A[Intake] --> B[Classify intent]
        B --> C[Retrieve context]
        C --> D[Call model]
        D --> E[Run tool]
        E --> F[Post-process]
    end

    B -. spikes .-> B1[Prompt too long]
    C -. spikes .-> C1[Too many sources]
    D -. spikes .-> D1[Model queueing]
    E -. spikes .-> E1[Slow dependency]

The fix is not “make the model faster.” The fix is to isolate each stage and assign a budget to it.

Stage budgets that actually help

Think in budgets, not vibes.

LATENCY_BUDGET_MS = {
        "intent_classification": 150,
        "context_retrieval": 700,
        "model_inference": 1600,
        "tool_execution": 1200,
        "response_finalization": 250,
        "p95_total": 3000,
        "p99_total": 5000,
}


def check_stage_budget(stage_times_ms: dict[str, int]) -> bool:
        total = sum(stage_times_ms.values())
        return total <= LATENCY_BUDGET_MS["p95_total"]

This is useful for two reasons.

First, it makes regressions visible. Second, it tells you where to invest engineering time. A 200 ms improvement in retrieval matters more than a 200 ms win in a stage that only runs on 10 percent of requests.

How to reduce p99 without breaking the product

flowchart LR
    A[Input] --> B{Confidence high enough?}
    B -- yes --> C[Return early]
    B -- no --> D[Retrieve more context]
    D --> E{Tool needed?}
    E -- yes --> F[Call tool with timeout]
    E -- no --> G[Generate answer]
    F --> G

That last point matters. Many systems are slow because they keep doing work after they already know enough to answer safely.

What to alert on

Alert on percentile drift, not just uptime.

The practical rule

If you only remember one thing, remember this: averages are for summaries, percentiles are for systems.

In agent products, p95 and p99 are the metrics that determine whether users trust the workflow enough to come back.

Related Posts