Claude API with Python: A Practical Guide to Building AI Applications from A to Z

Artificial Intelligence tutorial - IT technology blog
Artificial Intelligence tutorial - IT technology blog

Get Claude API Running in Your First 5 Minutes

I spent almost an entire morning digging through the docs of various AI APIs before trying Claude. The verdict: Claude API is the easiest to get started with — a clean SDK, consistent responses, and few unexpected quirks. If you’ve used the OpenAI SDK before, you’ll feel right at home.

Install the library first:

pip install anthropic

Done. Now let’s make an API call:

import anthropic

client = anthropic.Anthropic(api_key="sk-ant-api03-...")

message = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Explain what Docker is in 3 short sentences"}
    ]
)

print(message.content[0].text)

Run it — you’ll see an answer appear within a few seconds. That’s everything you need to have a working AI application. The rest is about making it good.

Get your API key at console.anthropic.com → API Keys → Create Key. Never hardcode the key into your code — use environment variables:

# Add to ~/.bashrc or ~/.zshrc
export ANTHROPIC_API_KEY="sk-ant-api03-..."

# The SDK reads this variable automatically if api_key is not passed
client = anthropic.Anthropic()  # No api_key needed!

Understanding the API Structure to Avoid Common Mistakes

Messages API — Using It Correctly

Claude uses the Messages API, not the Completion API used by older models. A conversation is structured as a list of alternating user and assistant messages:

messages = [
    {"role": "user", "content": "I'm learning Python"},
    {"role": "assistant", "content": "Python is a great choice! Where would you like to start?"},
    {"role": "user", "content": "Can we start with list comprehensions?"}
]

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    messages=messages
)

Rule: the first message must be user, and messages must alternate between user and assistant. Breaking this rule throws an error immediately — I hit this bug while building a chatbot because I forgot to properly trim the history.

System Prompt — Your Main Tool for Controlling Output

The system prompt is not inside messages but is a separate parameter:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system="You are a senior DevOps engineer. Keep answers concise, use bullet points. Always mention security implications.",
    messages=[
        {"role": "user", "content": "How do I set up an nginx reverse proxy?"}
    ]
)

Many other APIs stuff the system message at the top of the messages list like a regular message. Claude separates it into an independent parameter — the model processes this instruction with higher priority and is less likely to “forget” your requirements as the conversation grows longer.

Streaming — Essential for Good UX

Nobody wants to stare at a blank screen for 10 seconds before text appears. Streaming solves this:

import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Write a PostgreSQL database backup script"}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
    print()  # New line after completion

With Flask or FastAPI, you can stream directly to the browser using Server-Sent Events:

from flask import Flask, Response, stream_with_context
import anthropic
import json

app = Flask(__name__)
client = anthropic.Anthropic()

@app.route("/chat")
def chat():
    def generate():
        with client.messages.stream(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            messages=[{"role": "user", "content": "Hello!"}]
        ) as stream:
            for text in stream.text_stream:
                yield f"data: {json.dumps({'text': text})}\n\n"
        yield "data: [DONE]\n\n"
    
    return Response(
        stream_with_context(generate()),
        mimetype="text/event-stream"
    )

Advanced Techniques for Production Applications

Prompt Caching — Cut Costs by 90%

Got a system prompt that’s several thousand tokens long and you’re calling it hundreds of times a day? You’re throwing money out the window. I enabled caching on a project with a ~3,000-token system prompt at around 120 requests/day — the bill dropped from $2.1 to $0.4 per day:

# Enable prompt caching for long system prompts
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": """You are a technical assistant for an e-commerce system. 
            [... 5000 words of system documentation ...]""",
            "cache_control": {"type": "ephemeral"}  # Cache it!
        }
    ],
    messages=[{"role": "user", "content": query}]
)

The first request costs the full price to build the cache. From the second request onward, the cached portion is charged at just 10% of the normal rate. With a 3,000-token system prompt and 100 requests per day, that’s roughly $50 in savings per month.

Automatic Retry with Exponential Backoff

Rate limits and network errors are a fact of life. Don’t let your application crash because of transient failures:

import anthropic
import time

def call_claude_with_retry(messages, max_retries=3, **kwargs):
    client = anthropic.Anthropic()
    
    for attempt in range(max_retries):
        try:
            return client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                messages=messages,
                **kwargs
            )
        except anthropic.RateLimitError:
            if attempt == max_retries - 1:
                raise
            wait_time = (2 ** attempt) * 1  # 1s, 2s, 4s
            print(f"Rate limit hit, waiting {wait_time}s...")
            time.sleep(wait_time)
        except anthropic.APIConnectionError as e:
            if attempt == max_retries - 1:
                raise
            time.sleep(1)
    
    return None

Structured Output with JSON Mode

When you need to parse a response into structured data, don’t use regex — use JSON in the prompt:

import json

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system="Always respond with valid JSON only, no text outside the JSON.",
    messages=[{
        "role": "user",
        "content": """Analyze the following sentence and return JSON:
        'PostgreSQL server version 14.5 crashed with OOM error at 3am on February 15th'
        
        Format: {\"component\": str, \"version\": str, \"error_type\": str, \"time\": str}"""
    }]
)

data = json.loads(response.content[0].text)
print(data["error_type"])  # "OOM"

Practical Tips from Real-World Project Experience

Getting the API call working is just the first step. The more mentally taxing part — and the one easily ignored until the bill arrives — is managing costs and keeping output quality consistent. Here’s what I’ve learned, mostly the hard way:

  • Choose the right model for the task: Haiku for simple tasks (classify, extract), Sonnet for medium tasks (summarize, translate, code review), Opus for complex tasks (analysis, reasoning). Using Opus for everything costs 15x more than Haiku.
  • Set max_tokens close to your actual needs: If a response only needs 200 tokens, don’t set 4096. You’re only billed for actual output — but setting max_tokens high tends to make the model generate more verbosely than necessary, increasing latency and costing more.
  • Log token usage: The response object returns message.usage with input_tokens and output_tokens. Log these to track costs and spot any prompts consuming abnormally high token counts.
  • Test with claude-haiku-4-5-20251001 first: While developing, use Haiku to iterate quickly and cheaply. Only switch to Sonnet/Opus when you need to test real quality.
  • Limit conversation history: Keep only the most recent 10-20 messages instead of the entire history. The context window is large, but costs accumulate quickly.
# Efficient conversation history management pattern
def trim_history(messages, max_pairs=10):
    """Keep only the most recent max_pairs user/assistant pairs"""
    if len(messages) > max_pairs * 2:
        return messages[-(max_pairs * 2):]
    return messages

# Log usage to track costs
response = client.messages.create(...)
print(f"Input: {response.usage.input_tokens} tokens | Output: {response.usage.output_tokens} tokens")

One more thing I often see beginners do wrong: never expose your API key in source code committed to GitHub. Anthropic has an automated scanning system that will immediately revoke the key if detected. Use python-dotenv with a .env file added to .gitignore.

# .env
ANTHROPIC_API_KEY=sk-ant-api03-...

# .gitignore
.env
*.env
# main.py
from dotenv import load_dotenv
load_dotenv()

client = anthropic.Anthropic()  # Reads automatically from .env

The quick start above is enough to get a prototype running in a few hours. From there, expand in order of priority: streaming first (improves UX immediately), then caching when costs start to climb, and finally retry and error handling when deploying to production. Claude API is quite stable — the codebase I wrote 8 months ago still runs fine, with no breaking changes requiring major migrations.

Share: