Google Gemini API with Python: A Complete Guide from Basics to Advanced – ITFROMZERO

I started using the Gemini API in late 2023, mostly to benchmark it against GPT-4. What impressed me wasn’t the text quality — it was the native multimodal support right out of the box: send images, PDFs, even videos and ask questions directly. While other APIs charge extra for vision capabilities, Gemini bundles everything into a single endpoint.

Got a project that needs image analysis, PDF reading, or mixed-content processing? Gemini is worth trying — especially since its free tier is more generous than most competitors.

Table of Contents

Why Gemini and When Should You Use It?

The practical question: when should you pick Gemini over OpenAI or other providers?

The problem many developers hit is that costs scale up fast. GPT-4o runs about $5/1M input tokens. Gemini 2.0 Flash is free up to 15 requests/minute with a 1M token context window — incredibly generous for experimental projects or small startups.

Gemini’s specific strengths:

Massive context window: Gemini 1.5 Pro supports 1 million tokens — roughly 700,000 English words. Send entire log files without chunking.
Native multimodal: Text, image, audio, video, and PDF in a single API call
Google Search grounding: An exclusive feature that lets responses be verified against real search results
Actually usable free tier: 1,500 requests/day with Gemini 2.0 Flash, no credit card required

Setup and Getting Your API Key

Getting an API key from Google AI Studio

Go to aistudio.google.com, sign in with your Google account, then select “Get API key” → “Create API key”. The key will look like AIzaSy...

Note: API keys from AI Studio are for direct Google AI access only — not through Vertex. Need an SLA for a larger project? Migrating to Vertex AI is the next step — it’s more complex to set up, but gives you much better SLA guarantees and control.

Installing the library and storing credentials

pip install google-genai

Store your API key as an environment variable — never hardcode it in your source code:

export GEMINI_API_KEY="AIzaSy..."

Or create a .env file and use python-dotenv:

# .env
GEMINI_API_KEY=AIzaSy...

Configuration and Usage in Detail

Your first request

import os
from google import genai

client = genai.Client(api_key=os.environ.get("GEMINI_API_KEY"))

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="Giải thích Docker Compose bằng ngôn ngữ đơn giản"
)

print(response.text)

The result is available immediately via response.text — no extra unpacking needed. Compared to the anthropic or openai SDKs, the syntax is noticeably cleaner.

Multi-turn chat — maintaining conversation context

chat = client.chats.create(model="gemini-2.0-flash")

# Lượt 1
response = chat.send_message("Mình đang học Python, bắt đầu từ đâu?")
print("AI:", response.text)

# Lượt 2 — model nhớ context từ lượt 1
response = chat.send_message("Cho mình một ví dụ về list comprehension")
print("AI:", response.text)

Image processing — Multimodal

This is the feature I use most. Analyzing error screenshots, reading system architecture diagrams, extracting text from screen captures — all in a single API call:

from pathlib import Path

image_data = Path("screenshot.png").read_bytes()

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=[
        {
            "inline_data": {
                "mime_type": "image/png",
                "data": image_data
            }
        },
        "Đây là lỗi gì? Cách fix như thế nào?"
    ]
)

print(response.text)

Structured Output — Enforcing a JSON schema

Instead of manually parsing text (and endlessly handling edge cases), Gemini supports enforcing a JSON schema via response_schema. Extremely useful when building automated pipelines:

from google.genai import types
import json

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="Liệt kê 3 lệnh Linux hay dùng nhất cho sysadmin",
    config=types.GenerateContentConfig(
        response_mime_type="application/json",
        response_schema={
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "command": {"type": "string"},
                    "description": {"type": "string"},
                    "example": {"type": "string"}
                }
            }
        }
    )
)

data = json.loads(response.text)
for item in data:
    print(f"$ {item['command']}: {item['description']}")

Generation config and Safety settings

By default, Gemini’s safety filters are fairly strict — sometimes blocking perfectly legitimate technical content like penetration testing guides or security research. Here’s how to adjust them:

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=prompt,
    config=types.GenerateContentConfig(
        temperature=0.7,        # 0.0 = deterministic, 2.0 = creative
        top_p=0.95,
        max_output_tokens=8192,
        safety_settings=[
            types.SafetySetting(
                category="HARM_CATEGORY_DANGEROUS_CONTENT",
                threshold="BLOCK_ONLY_HIGH"
            )
        ]
    )
)

Streaming responses

Waiting for a long response to finish before displaying anything makes for a poor user experience. Streaming solves this — users see text appear incrementally, just like ChatGPT:

for chunk in client.models.generate_content_stream(
    model="gemini-2.0-flash",
    contents="Viết hướng dẫn cài đặt Nginx trên Ubuntu"
):
    print(chunk.text, end="", flush=True)

Monitoring and Observability

Tracking token usage

Every Gemini response includes metadata about tokens consumed — important for keeping costs in check even on the free tier:

usage = response.usage_metadata
print(f"Input tokens:  {usage.prompt_token_count}")
print(f"Output tokens: {usage.candidates_token_count}")
print(f"Total tokens:  {usage.total_token_count}")

Error handling with retry and exponential backoff

In production, catching the right exception types makes debugging significantly faster:

from google.genai import errors
import time

def safe_generate(client, prompt, retries=3):
    for attempt in range(retries):
        try:
            response = client.models.generate_content(
                model="gemini-2.0-flash",
                contents=prompt
            )
            return response.text

        except errors.ClientError as e:
            if "RATE_LIMIT_EXCEEDED" in str(e):
                wait = 2 ** attempt  # exponential backoff
                print(f"Rate limit, chờ {wait}s...")
                time.sleep(wait)
            elif "INVALID_API_KEY" in str(e):
                raise  # Lỗi này không retry được
            else:
                raise

        except errors.ServerError as e:
            print(f"Lỗi server Gemini (attempt {attempt+1}): {e}")
            time.sleep(2)

    return None

Quotas and rate limits to know

The free tier for Gemini 2.0 Flash gives you 15 RPM, 1,500 RPD, and 1M TPM. More than enough for personal projects or prototypes. To check your current quota usage, go to console.cloud.google.com → APIs & Services → Generative Language API → Quotas.

When you need to scale up, Vertex AI Gemini lets you increase quotas on demand — but you’ll need to set up a Google Cloud project and billing account, which is considerably more involved than AI Studio.

I’ve used the Gemini API for an internal log analysis tool — feeding in log chunks and asking “anything unusual here?” Gemini 1.5 Pro’s 1M token context window is exactly why I chose it over other models: no need to split logs into chunks and merge results, just send the whole file and you’re done. For use cases that involve processing large volumes of text, that’s a hard advantage to ignore.