Skip to content

Cutting Claude API Costs by 89% with Prompt Caching

A

Al Amin Ahamed

Senior Engineer

8 min read
𝕏 in

Anthropic shipped prompt caching in beta in mid-2024 and made it generally available shortly after. If you're calling Claude with a static system prompt + variable user input — which describes ~every chat application — you're leaving 80-90% of the cost on the table by not using it.

Here's the integration I built for this portfolio's AI chat.

How Prompt Caching Works

Claude's API lets you mark a section of the prompt as cacheable. On the first request, you pay the normal input rate (and a 25% surcharge to write the cache). On subsequent requests within 5 minutes, the cached portion costs 10% of the normal rate.

The cache key is the byte-for-byte content of the marked section plus everything before it. Change one character and you get a cache miss.

What's Worth Caching

Three things almost always:

  1. System prompt — instructions, persona, formatting rules
  2. RAG context — retrieved documents (if they're stable across turns)
  3. Tool definitions — JSON schemas for function calling

What's NOT worth caching:

  • The user's message (changes every turn)
  • The conversation history if it grows on every turn (cache misses on each new message)

The API Call

$response = Http::withHeaders([ 'x-api-key' => config('services.anthropic.key'), 'anthropic-version' => '2023-06-01', 'anthropic-beta' => 'prompt-caching-2024-07-31', ]) ->post('https://api.anthropic.com/v1/messages', [ 'model' => 'claude-sonnet-4-6', 'max_tokens' => 1024, 'system' => [ [ 'type' => 'text', 'text' => $this->buildSystemPrompt($context), 'cache_control' => ['type' => 'ephemeral'], ], ], 'messages' => [ ['role' => 'user', 'content' => $userMessage], ], ]);

Two changes from a normal call:

  1. The anthropic-beta header (only needed during beta — drop after GA in your region)
  2. The system field becomes an array of blocks, with cache_control on the cacheable parts

The 1024-Token Minimum

Cache blocks must be at least 1024 tokens for Sonnet (2048 for Opus, 4096 for Haiku — check the docs for current limits). Below that, the API silently doesn't cache and you pay the normal rate.

For a chat app, this means the system prompt + RAG context combined needs to be substantial. My system prompt alone is ~600 tokens. Adding 8 RAG chunks at ~400 tokens each gets us to ~3800 tokens, comfortably above the threshold.

Reading the Cache Hit

The API response includes usage stats:

{ "usage": { "input_tokens": 12, "cache_creation_input_tokens": 0, "cache_read_input_tokens": 3800, "output_tokens": 280 } }
  • cache_creation_input_tokens > 0 → cache miss, you wrote the cache
  • cache_read_input_tokens > 0 → cache hit, you paid 10% rate
  • input_tokens → uncached portion (the user message + any unmarked content)

Log this to verify caching is working:

Log::info('Claude API call', [ 'cache_read' => $response->json('usage.cache_read_input_tokens', 0), 'cache_write' => $response->json('usage.cache_creation_input_tokens', 0), 'uncached' => $response->json('usage.input_tokens'), ]);

If cache_read is always 0, your cache key is changing between requests. Common culprits: timestamps in the system prompt, randomised greetings, request IDs.

The 5-Minute TTL

The cache lives for 5 minutes after the last hit. For chat applications this is usually fine — users in an active conversation will hit the cache repeatedly. Idle conversations expire and the next message rebuilds the cache.

For background batch processing, the 5-minute TTL is a problem. Solutions:

  • Process related items together so they share a cache window
  • Use longer-context Claude calls that bundle many items per request
  • Anthropic offers a 1-hour cache tier (extra cost, request via support)

Cache Granularity — Multiple Breakpoints

You can cache up to 4 sections of the prompt independently. Useful when one part is more stable than another:

'system' => [ [ 'type' => 'text', 'text' => $persona, // rarely changes 'cache_control' => ['type' => 'ephemeral'], ], [ 'type' => 'text', 'text' => $ragContext, // changes per topic 'cache_control' => ['type' => 'ephemeral'], ], ],

If only the RAG context changes, the persona block still hits cache.

Cost Numbers

For the portfolio AI chat (claude-sonnet-4-6 pricing):

  • Without caching: 3800 input tokens × $3/M = $0.0114 per request
  • With caching (after first): 3800 cached × $0.30/M + ~50 user × $3/M = $0.0013 per request
  • ~89% reduction

A 100-turn conversation: $1.14 → $0.13.

When Not To Cache

  • One-shot calls where you'll never reuse the prompt
  • Prompts under the minimum token threshold
  • High-cardinality system prompts (e.g. per-user personalization with thousands of variants — each user gets one cache miss)

For high-cardinality cases, restructure: put the user-specific bit as the user message, cache the generic bit.

Share 𝕏 in
A

Al Amin Ahamed

Senior software engineer & AI practitioner. Laravel, PHP, WordPress plugins, WooCommerce extensions.

About me →

One email a month. No noise.

What I shipped, what I read, occasional deep dive. Unsubscribe anytime.