Skyrocketing API Costs — A Problem Everyone Faces
Last month, I opened my LLM API bill and got a shock: $340 in just 30 days, even though the app didn’t have many users yet. After a few hours digging through the logs, I found two main culprits: over 60% of tokens came from the same system prompt being sent with every single request, and hundreds of small requests were being called individually instead of batched together.
Claude, GPT-4, Gemini — every LLM charges by the token, and sooner or later this will catch up with you. Here are 3 techniques that helped me cut costs by around 55% without any drop in output quality.
3 Main Reasons API Costs Spiral Out of Control
LLM APIs charge by token count — inputs plus outputs. In practice, three things tend to eat up tokens the most:
- Long repeated system prompts: Every request sends the same system prompt, but you’re billed in full for it every single time.
- Piecemeal requests: Calling the API one at a time instead of batching — throwing away the discount that batch APIs offer.
- “Junk” tokens in prompts: Extra whitespace, repeated instructions, unnecessary context — it all shows up on the bill.
Technique 1: Prompt Caching — Only Pay for What Changes
The concept is simple: the provider “remembers” the unchanged portion of your prompt between requests. The first call is billed at the normal rate. Subsequent calls cost only 10% of the original price for the cached portion (with Anthropic). It sounds modest, but with a 2,000-token system prompt called 10,000 times per day, the savings add up fast.
Anthropic Claude supports caching from Claude 3.5 onward. Just add cache_control to the section you want cached:
import anthropic
client = anthropic.Anthropic()
# Long system prompt (e.g., 2000 tokens) — the part to cache
system_prompt = """
You are a financial analysis expert with 15 years of experience...
[full long instruction content here]
"""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"} # Mark for caching
}
],
messages=[
{"role": "user", "content": "Analyze this Q1 financial report..."}
]
)
# Check cache hits
usage = response.usage
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
print(f"Regular input tokens: {usage.input_tokens}")
Subsequent requests using the same system prompt read that portion from cache instead of reprocessing it. The cache lives for at least 5 minutes and auto-refreshes each time it’s used. For chatbots or AI assistants where many users share a long system prompt, savings can reach 80–90% of input tokens.
When Should You Use Prompt Caching?
- System prompt exceeds 1,024 tokens (Anthropic’s minimum caching threshold)
- Many requests share a long common prefix — for example, entire documentation used as context
- RAG with a fixed knowledge base — cache the document section, only swap out the user query each time
Technique 2: Batching — Bundle Requests to Cut Costs
Instead of calling the API 1,000 times for 1,000 items, the Batch API lets you send everything at once and receive results asynchronously later. Anthropic’s Message Batches API accepts up to 10,000 requests per batch at just 50% of the standard per-call price. The only trade-off: no real-time results.
import anthropic
import json
client = anthropic.Anthropic()
# Items to process
items_to_process = [
{"id": "item_001", "text": "Classify sentiment: 'Amazing product!'"},
{"id": "item_002", "text": "Classify sentiment: 'Delivery was way too slow.'"},
{"id": "item_003", "text": "Classify sentiment: 'It was okay, nothing special.'"},
# ... add hundreds/thousands of items
]
# Build batch requests
requests = [
{
"custom_id": item["id"],
"params": {
"model": "claude-haiku-4-5-20251001", # Cheaper model for simple tasks
"max_tokens": 50, # Limit output length
"messages": [{"role": "user", "content": item["text"]}]
}
}
for item in items_to_process
]
# Send batch
batch = client.messages.batches.create(requests=requests)
print(f"Batch ID: {batch.id}")
print(f"Status: {batch.processing_status}")
# Save batch_id to query later (batch may take minutes to hours)
with open("batch_ids.json", "a") as f:
json.dump({"batch_id": batch.id, "count": len(requests)}, f)
# Separate script to retrieve results when batch is complete
client = anthropic.Anthropic()
batch_id = "msgbatch_xxx" # ID from when the batch was created
batch = client.messages.batches.retrieve(batch_id)
if batch.processing_status == "ended":
results = {}
for result in client.messages.batches.results(batch_id):
if result.result.type == "succeeded":
results[result.custom_id] = result.result.message.content[0].text
print(f"Done processing: {len(results)} items")
Tips for Using the Batch API
- Use a smaller model for simple tasks (classification, extraction) —
claude-haikuinstead ofclaude-sonnet - Set
max_tokensclose to actual expected output — don’t set 4096 when you only need 50 tokens - Design your application to be async when real-time isn’t needed: nightly processing, data pipelines
Technique 3: Cutting Out Unnecessary Tokens
The simplest technique, and the most commonly overlooked. I once reviewed a codebase and found a prompt where over 500 tokens were nothing but whitespace, redundant comments, and repeated instructions. 500 tokens × millions of requests is far from trivial.
3.1 Compress Your System Prompt
# BAD — verbose, lots of extra whitespace (~120 tokens)
system_prompt_bad = """
You are a smart and helpful AI assistant.
You will help users answer their questions.
When responding, you should:
- Answer clearly and completely
- Use easy-to-understand language
- Don't say things you're not sure about
"""
# GOOD — concise, same meaning, saves ~70% tokens (~35 tokens)
system_prompt_good = "Helpful IT assistant. Answer concisely and accurately. Acknowledge when uncertain."
3.2 Trim Conversation History
A chatbot with long conversation history is a common token trap. A user chats for two hours, every request sends the full history — by the end of the session, each message costs 10–20x what it did at the start. Keeping only the N most recent messages is enough:
def trim_conversation_history(messages: list, max_messages: int = 10) -> list:
"""Keep only the N most recent messages to reduce token count."""
if len(messages) <= max_messages:
return messages
# Always keep the first message (usually important context)
return [messages[0]] + messages[-(max_messages - 1):]
def count_tokens_estimate(text: str) -> int:
"""Estimate token count: ~3 characters = 1 token for mixed-language content."""
return len(text) // 3
# Check before sending
conversation = load_conversation_history(user_id)
total_tokens = sum(count_tokens_estimate(m["content"]) for m in conversation)
if total_tokens > 50_000: # Warning threshold
conversation = trim_conversation_history(conversation, max_messages=6)
3.3 Match the Model to the Task
I often see people defaulting to the most powerful model for everything. But classifying sentiment in a short piece of text doesn’t need Opus. Here’s a quick reference for choosing the right model for each use case:
- Simple classification, sentiment analysis, extraction: Claude Haiku (~20x cheaper than Opus)
- Content writing, summarization, general Q&A: Claude Sonnet (the quality/price sweet spot)
- Complex reasoning, code review, deep analysis: Claude Opus or the latest Sonnet
Measure to Know You’re Optimizing in the Right Direction
Optimizing without measuring is flying blind. I track 3 metrics: total tokens per request, cache hit rate, and estimated cost per feature. Add logging from day one — don’t wait until you get a “shocking” bill to start:
def log_api_usage(response) -> dict:
usage = response.usage
cache_read = getattr(usage, 'cache_read_input_tokens', 0)
cache_write = getattr(usage, 'cache_creation_input_tokens', 0)
total_input = usage.input_tokens + cache_read + cache_write
cache_hit_rate = cache_read / total_input if total_input > 0 else 0
# Claude Sonnet 4.6 pricing (verify current rates on the Anthropic website)
cost = (
usage.input_tokens * 3 / 1_000_000 # Regular input: $3/MTok
+ cache_write * 3.75 / 1_000_000 # Cache write: $3.75/MTok
+ cache_read * 0.30 / 1_000_000 # Cache read: $0.30/MTok
+ usage.output_tokens * 15 / 1_000_000 # Output: $15/MTok
)
print(f"Cost: ${cost:.4f} | Cache hit: {cache_hit_rate:.1%} | Output: {usage.output_tokens} tok")
return {"cost": cost, "cache_hit_rate": cache_hit_rate}
Results and Implementation Priority
After applying all three techniques, my API bill dropped from $340 to ~$150/month — nearly 56% — with no change in output quality, and latency actually improved thanks to batch processing.
If you’re just starting to optimize, here’s the order I recommend:
- Token reduction first: Compress prompts, remove extra whitespace — no code changes needed, you can do it right now
- Add prompt caching: System prompt over 1,024 tokens? Just add
cache_controland you’re done - Switch to batching: When you have use cases that don’t need real-time results — data processing, bulk content generation
Add monitoring from day one. Set cost threshold alerts. Review token usage regularly per feature. These three small habits — once you start scaling — will save more money than any optimization technique alone.

