Cutting Claude API Costs by 89% with Prompt Caching

Anthropic shipped prompt caching in beta in mid-2024 and made it generally available shortly after. If you're calling Claude with a static system prompt + variable user input — which describes ~every chat application — you're leaving 80-90% of the cost on the table by not using it.

Here's the integration I built for this portfolio's AI chat.

How Prompt Caching Works

Claude's API lets you mark a section of the prompt as cacheable. On the first request, you pay the normal input rate (and a 25% surcharge to write the cache). On subsequent requests within 5 minutes, the cached portion costs 10% of the normal rate.

The cache key is the byte-for-byte content of the marked section plus everything before it. Change one character and you get a cache miss.

What's Worth Caching

Three things almost always:

System prompt — instructions, persona, formatting rules
RAG context — retrieved documents (if they're stable across turns)
Tool definitions — JSON schemas for function calling

What's NOT worth caching:

The user's message (changes every turn)
The conversation history if it grows on every turn (cache misses on each new message)

The API Call

$response = Http::withHeaders([
        'x-api-key' => config('services.anthropic.key'),
        'anthropic-version' => '2023-06-01',
        'anthropic-beta' => 'prompt-caching-2024-07-31',
    ])
    ->post('https://api.anthropic.com/v1/messages', [
        'model' => 'claude-sonnet-4-6',
        'max_tokens' => 1024,
        'system' => [
            [
                'type' => 'text',
                'text' => $this->buildSystemPrompt($context),
                'cache_control' => ['type' => 'ephemeral'],
            ],
        ],
        'messages' => [
            ['role' => 'user', 'content' => $userMessage],
        ],
    ]);

Two changes from a normal call:

The anthropic-beta header (only needed during beta — drop after GA in your region)
The system field becomes an array of blocks, with cache_control on the cacheable parts

The 1024-Token Minimum

Cache blocks must be at least 1024 tokens for Sonnet (2048 for Opus, 4096 for Haiku — check the docs for current limits). Below that, the API silently doesn't cache and you pay the normal rate.

For a chat app, this means the system prompt + RAG context combined needs to be substantial. My system prompt alone is ~600 tokens. Adding 8 RAG chunks at ~400 tokens each gets us to ~3800 tokens, comfortably above the threshold.

Reading the Cache Hit

The API response includes usage stats:

{
  "usage": {
    "input_tokens": 12,
    "cache_creation_input_tokens": 0,
    "cache_read_input_tokens": 3800,
    "output_tokens": 280
  }
}

cache_creation_input_tokens > 0 → cache miss, you wrote the cache
cache_read_input_tokens > 0 → cache hit, you paid 10% rate
input_tokens → uncached portion (the user message + any unmarked content)

Log this to verify caching is working:

Log::info('Claude API call', [
    'cache_read' => $response->json('usage.cache_read_input_tokens', 0),
    'cache_write' => $response->json('usage.cache_creation_input_tokens', 0),
    'uncached' => $response->json('usage.input_tokens'),
]);

If cache_read is always 0, your cache key is changing between requests. Common culprits: timestamps in the system prompt, randomised greetings, request IDs.

The 5-Minute TTL

The cache lives for 5 minutes after the last hit. For chat applications this is usually fine — users in an active conversation will hit the cache repeatedly. Idle conversations expire and the next message rebuilds the cache.

For background batch processing, the 5-minute TTL is a problem. Solutions:

Process related items together so they share a cache window
Use longer-context Claude calls that bundle many items per request
Anthropic offers a 1-hour cache tier (extra cost, request via support)

Cache Granularity — Multiple Breakpoints

You can cache up to 4 sections of the prompt independently. Useful when one part is more stable than another:

'system' => [
    [
        'type' => 'text',
        'text' => $persona, // rarely changes
        'cache_control' => ['type' => 'ephemeral'],
    ],
    [
        'type' => 'text',
        'text' => $ragContext, // changes per topic
        'cache_control' => ['type' => 'ephemeral'],
    ],
],

If only the RAG context changes, the persona block still hits cache.

Cost Numbers

For the portfolio AI chat (claude-sonnet-4-6 pricing):

Without caching: 3800 input tokens × $3/M = $0.0114 per request
With caching (after first): 3800 cached × $0.30/M + ~50 user × $3/M = $0.0013 per request
~89% reduction

A 100-turn conversation: $1.14 → $0.13.

When Not To Cache

One-shot calls where you'll never reuse the prompt
Prompts under the minimum token threshold
High-cardinality system prompts (e.g. per-user personalization with thousands of variants — each user gets one cache miss)

For high-cardinality cases, restructure: put the user-specific bit as the user message, cache the generic bit.

Share 𝕏 in

Al Amin Ahamed

Senior software engineer & AI practitioner. Laravel, PHP, WordPress plugins, WooCommerce extensions.

About me →

One email a month. No noise.

What I shipped, what I read, occasional deep dive. Unsubscribe anytime.