Claude API with Python: A Practical Guide to Building AI Applications from A to Z – ITFROMZERO

Table of Contents

Get Claude API Running in Your First 5 Minutes

I spent almost an entire morning digging through the docs of various AI APIs before trying Claude. The verdict: Claude API is the easiest to get started with — a clean SDK, consistent responses, and few unexpected quirks. If you’ve used the OpenAI SDK before, you’ll feel right at home.

Install the library first:

pip install anthropic

Done. Now let’s make an API call:

import anthropic

client = anthropic.Anthropic(api_key="sk-ant-api03-...")

message = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Explain what Docker is in 3 short sentences"}
    ]
)

print(message.content[0].text)

Run it — you’ll see an answer appear within a few seconds. That’s everything you need to have a working AI application. The rest is about making it good.

Get your API key at console.anthropic.com → API Keys → Create Key. Never hardcode the key into your code — use environment variables:

# Add to ~/.bashrc or ~/.zshrc
export ANTHROPIC_API_KEY="sk-ant-api03-..."

# The SDK reads this variable automatically if api_key is not passed
client = anthropic.Anthropic()  # No api_key needed!

Understanding the API Structure to Avoid Common Mistakes

Messages API — Using It Correctly

Claude uses the Messages API, not the Completion API used by older models. A conversation is structured as a list of alternating user and assistant messages:

messages = [
    {"role": "user", "content": "I'm learning Python"},
    {"role": "assistant", "content": "Python is a great choice! Where would you like to start?"},
    {"role": "user", "content": "Can we start with list comprehensions?"}
]

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    messages=messages
)

Rule: the first message must be user, and messages must alternate between user and assistant. Breaking this rule throws an error immediately — I hit this bug while building a chatbot because I forgot to properly trim the history.

System Prompt — Your Main Tool for Controlling Output

The system prompt is not inside messages but is a separate parameter:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system="You are a senior DevOps engineer. Keep answers concise, use bullet points. Always mention security implications.",
    messages=[
        {"role": "user", "content": "How do I set up an nginx reverse proxy?"}
    ]
)

Many other APIs stuff the system message at the top of the messages list like a regular message. Claude separates it into an independent parameter — the model processes this instruction with higher priority and is less likely to “forget” your requirements as the conversation grows longer.

Streaming — Essential for Good UX

Nobody wants to stare at a blank screen for 10 seconds before text appears. Streaming solves this:

import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Write a PostgreSQL database backup script"}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
    print()  # New line after completion

With Flask or FastAPI, you can stream directly to the browser using Server-Sent Events:

from flask import Flask, Response, stream_with_context
import anthropic
import json

app = Flask(__name__)
client = anthropic.Anthropic()

@app.route("/chat")
def chat():
    def generate():
        with client.messages.stream(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            messages=[{"role": "user", "content": "Hello!"}]
        ) as stream:
            for text in stream.text_stream:
                yield f"data: {json.dumps({'text': text})}\n\n"
        yield "data: [DONE]\n\n"
    
    return Response(
        stream_with_context(generate()),
        mimetype="text/event-stream"
    )

Advanced Techniques for Production Applications

Prompt Caching — Cut Costs by 90%

Got a system prompt that’s several thousand tokens long and you’re calling it hundreds of times a day? You’re throwing money out the window. I enabled caching on a project with a ~3,000-token system prompt at around 120 requests/day — the bill dropped from $2.1 to $0.4 per day:

# Enable prompt caching for long system prompts
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": """You are a technical assistant for an e-commerce system. 
            [... 5000 words of system documentation ...]""",
            "cache_control": {"type": "ephemeral"}  # Cache it!
        }
    ],
    messages=[{"role": "user", "content": query}]
)

The first request costs the full price to build the cache. From the second request onward, the cached portion is charged at just 10% of the normal rate. With a 3,000-token system prompt and 100 requests per day, that’s roughly $50 in savings per month.

Automatic Retry with Exponential Backoff

Rate limits and network errors are a fact of life. Don’t let your application crash because of transient failures:

import anthropic
import time

def call_claude_with_retry(messages, max_retries=3, **kwargs):
    client = anthropic.Anthropic()
    
    for attempt in range(max_retries):
        try:
            return client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                messages=messages,
                **kwargs
            )
        except anthropic.RateLimitError:
            if attempt == max_retries - 1:
                raise
            wait_time = (2 ** attempt) * 1  # 1s, 2s, 4s
            print(f"Rate limit hit, waiting {wait_time}s...")
            time.sleep(wait_time)
        except anthropic.APIConnectionError as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(1)
    
    return None

Structured Output with JSON Mode

When you need to parse a response into structured data, don’t use regex — use JSON in the prompt:

import json

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system="Always respond with valid JSON only, no text outside the JSON.",
    messages=[{
        "role": "user",
        "content": """Analyze the following sentence and return JSON:
        'PostgreSQL server version 14.5 crashed with OOM error at 3am on February 15th'
        
        Format: {\"component\": str, \"version\": str, \"error_type\": str, \"time\": str}"""
    }]
)

data = json.loads(response.content[0].text)
print(data["error_type"])  # "OOM"

Practical Tips from Real-World Project Experience

Getting the API call working is just the first step. The more mentally taxing part — and the one easily ignored until the bill arrives — is managing costs and keeping output quality consistent. Here’s what I’ve learned, mostly the hard way:

Choose the right model for the task: Haiku for simple tasks (classify, extract), Sonnet for medium tasks (summarize, translate, code review), Opus for complex tasks (analysis, reasoning). Using Opus for everything costs 15x more than Haiku.
Set max_tokens close to your actual needs: If a response only needs 200 tokens, don’t set 4096. You’re only billed for actual output — but setting max_tokens high tends to make the model generate more verbosely than necessary, increasing latency and costing more.
Log token usage: The response object returns message.usage with input_tokens and output_tokens. Log these to track costs and spot any prompts consuming abnormally high token counts.
Test with claude-haiku-4-5-20251001 first: While developing, use Haiku to iterate quickly and cheaply. Only switch to Sonnet/Opus when you need to test real quality.
Limit conversation history: Keep only the most recent 10-20 messages instead of the entire history. The context window is large, but costs accumulate quickly.

# Efficient conversation history management pattern
def trim_history(messages, max_pairs=10):
    """Keep only the most recent max_pairs user/assistant pairs"""
    if len(messages) > max_pairs * 2:
        return messages[-(max_pairs * 2):]
    return messages

# Log usage to track costs
response = client.messages.create(...)
print(f"Input: {response.usage.input_tokens} tokens | Output: {response.usage.output_tokens} tokens")

One more thing I often see beginners do wrong: never expose your API key in source code committed to GitHub. Anthropic has an automated scanning system that will immediately revoke the key if detected. Use python-dotenv with a .env file added to .gitignore.

# .env
ANTHROPIC_API_KEY=sk-ant-api03-...

# .gitignore
.env
*.env

# main.py
from dotenv import load_dotenv
load_dotenv()

client = anthropic.Anthropic()  # Reads automatically from .env

The quick start above is enough to get a prototype running in a few hours. From there, expand in order of priority: streaming first (improves UX immediately), then caching when costs start to climb, and finally retry and error handling when deploying to production. Claude API is quite stable — the codebase I wrote 8 months ago still runs fine, with no breaking changes requiring major migrations.