Extracting Data from LLMs: Stop Manual JSON Parsing, Use Instructor Instead

Artificial Intelligence tutorial - IT technology blog
Artificial Intelligence tutorial - IT technology blog

The Nightmare Named “Invalid JSON”

If you’ve ever called the OpenAI API to fetch data, this scenario is surely familiar: You ask the AI to return JSON, but it generously adds a preamble: “Here is your result:”. Or worse, a closing brace is missing at the end of the string. Consequently, the json.loads() function throws a JSONDecodeError instantly.

In the past, I often had to use Regex to “clean up” the results or write a page-long System Prompt just to beg the AI not to talk too much. However, after 6 months of putting Instructor into production, I’ve completely abandoned that painful approach. The parsing error rate dropped from 10-15% to almost zero.

Instructor is more than just a library. It’s a new mindset: Using Pydantic to define schemas and forcing LLMs to comply absolutely. If the AI returns something wrong? Instructor automatically takes the error log, throws it back at the AI, and says: “Wrong, fix it!”.

Installation in 30 Seconds

Instructor acts as a smart wrapper around the original OpenAI or Anthropic clients. You just need to install the main package along with Pydantic v2:

pip install -U instructor openai pydantic

This library provides excellent support for the tool_use (function calling) mechanism, making data extraction more stable than ever.

Real-world Implementation: From Schema to Clean Data

The workflow with Instructor is condensed into 3 steps: Define the Schema, Wrap the Client, and Call the API. Let’s see how I handle a messy chat snippet below.

1. Defining the Data “Template”

Instead of describing it in words, you use pure Python code. This provides your system with full Type Hinting.

from pydantic import BaseModel, Field
from typing import List, Optional

class UserInfo(BaseModel):
    name: str = Field(..., description="Full name")
    age: int = Field(..., description="Age")
    email: Optional[str] = Field(None, description="Email if available")
    skills: List[str] = Field(default_factory=list, description="Skills")

2. Wrapping the Client and Extracting

Instead of using the old patch() function, the latest version of Instructor recommends using from_openai(). This approach is more transparent and easier to debug.

import instructor
from openai import OpenAI

# Initialize the smart client
client = instructor.from_openai(OpenAI(api_key="YOUR_KEY"))

text_input = "Hello, I'm Nguyen Van A, 28 years old. I know Python and JS. Email: [email protected]"

# Extract data directly into the object
user = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=UserInfo,
    messages=[{"role": "user", "content": text_input}],
    max_retries=3 # Automatically force AI to fix errors up to 3 times
)

print(f"{user.name} ({user.age} years old) is proficient in: {', '.join(user.skills)}")

At this point, user is a Pydantic instance. You can use IntelliSense to call attributes without worrying about typos in key names as you would with a Dictionary.

Why is the Auto-retry Mechanism Important?

In a production environment, LLMs sometimes confuse data types—for example, returning age as the string “twenty-eight”. With Instructor, you don’t need to worry. When Pydantic reports a validation error, the library automatically sends a detailed error message back to the AI. The AI then looks at that error to adjust its output in the next retry.

Battle-tested Tips for Production

After a long period of application, I’ve distilled 3 golden rules to optimize performance and cost:

  • Prioritize Small Models: You don’t need GPT-4o for simple extraction tasks. gpt-4o-mini or claude-3-haiku combined with Instructor provide equivalent accuracy but are up to 20 times cheaper.
  • Validators are the Ultimate Weapon: Use field_validator to control business logic. For example: If the extracted age is a negative number, force the AI to redo it immediately.
  • Monitor Retry Counts: If a task frequently requires a 3rd retry, it’s a sign that your Prompt or Schema is too complex.

See how I block junk data using a Validator:

from pydantic import field_validator

class UserInfo(BaseModel):
    name: str
    age: int

    @field_validator('age')
    @classmethod
    def validate_age(cls, v):
        if v < 0 or v > 120:
            raise ValueError("Age must be between 0 and 120")
        return v

When the LLM returns age: -1, Instructor tells it: “Hey, age cannot be negative, please check again”. This is the self-healing capability that traditional JSON parsing methods simply cannot provide.

Switching to Instructor is like upgrading from manual labor to an automated assembly line. It lets you focus on application logic instead of cleaning up after the AI. If your project is struggling with inconsistent LLM data, try integrating it today.

Share: