Build Your Own Text-to-SQL System with Vanna.ai: Let AI Handle 80% of Mundane Reporting Tasks

Artificial Intelligence tutorial - IT technology blog
Artificial Intelligence tutorial - IT technology blog

The Nightmare of “Hey, can you pull this report for me?”

If you’re a Data Engineer or Backend Developer, you’re likely familiar with the scene: you’re deep in code when a message pops up from Marketing. Questions like “What was last week’s revenue in the North?” aren’t hard, but they eat up a lot of small chunks of time. After six months of running Vanna.ai in production, I’ve found it to be a true lifesaver.

This system helps reduce minor query tickets for the data team by up to 70%. Instead of writing dozens of SQL queries every day, I train the AI to understand the database structure so it can serve users directly. In this article, we’ll dive into how to set up a complete Text-to-SQL system using Python and Vanna.ai.

Why Choose Vanna.ai Over Pure Prompt Engineering?

Many might wonder: “Why not just copy the schema into ChatGPT and have it write SQL?” In reality, with corporate databases exceeding 200 tables and thousands of columns, dumping the entire schema into a prompt is impossible. You’ll quickly hit token limits and face serious data security risks.

Vanna.ai solves this using a specialized RAG (Retrieval-Augmented Generation) approach. Your actual data stays put. Vanna only stores metadata like DDL, column descriptions, and sample SQL queries as vectors. When a question is asked, it searches for the most relevant structures to provide context for the LLM. The result? Real-world accuracy increases from 60% to over 92% after thorough training.

3 Core Layers That Keep Vanna.ai Running Smoothly

To implement it effectively, you need to understand how Vanna categorizes management:

  • Infrastructure: Where you choose the “brain” (GPT-4, Claude 3, Ollama) and the vector store (ChromaDB, Pinecone).
  • Knowledge Base: The repository containing DDL statements, business documentation, and verified “Golden” SQL queries.
  • Execution: The direct bridge to the DB to execute SQL commands and return results as tables or visual charts.

Getting Started: Building the Text-to-SQL System

We’ll demo this using SQLite. These steps apply similarly to PostgreSQL, MySQL, or SQL Server in a production environment.

1. Environment Setup

You should use Python 3.9 or higher and create a virtual environment to avoid library conflicts.

pip install vanna

2. Configuring Vanna

Vanna is flexible, allowing you to run it entirely locally for absolute security. In this example, I’m using OpenAI as the language processing brain and ChromaDB for vector storage.

import vanna as vn
from vanna.openai import OpenAI_Chat
from vanna.chromadb import ChromaDB_VectorStore

class MyVanna(ChromaDB_VectorStore, OpenAI_Chat):
    def __init__(self, config=None):
        ChromaDB_VectorStore.__init__(self, config=config)
        OpenAI_Chat.__init__(self, config=config)

vn = MyVanna(config={'api_key': 'YOUR_OPENAI_API_KEY', 'model': 'gpt-4'})

3. Connecting to the Database

Testing the connection with the database file my_business.sqlite.

vn.connect_to_sqlite('my_business.sqlite')

4. Knowledge Training – The Critical Phase

AI is only as smart as its understanding of your DB’s “rules.” Never feed it data haphazardly. I usually split training into three layers to optimize accuracy:

Layer 1: Feeding the table structure (DDL). Helps the AI know which table contains what information.

# Fetch table structures
ddl = vn.run_sql("SELECT sql FROM sqlite_master WHERE type='table';")
for table_ddl in ddl['sql']:
    vn.train(ddl=table_ddl)

Layer 2: Business Documentation. The AI will be confused if status_id = 1 means “Paid” but you don’t explain it.

vn.train(documentation="status_id column: 1 means Success, 0 means Canceled.")

Layer 3: Sample SQL (Golden SQL). This is the most effective way to teach AI through real-world examples.

vn.train(sql="SELECT customer_name, SUM(total_amount) FROM orders GROUP BY 1 ORDER BY 2 DESC LIMIT 10;")

5. Running the Test

Now it’s time to enjoy the results by asking questions in natural language.

question = "Who is the customer with the most purchases this month?"
sql = vn.generate_sql(question)
print(f"Generated SQL: {sql}")

df = vn.run_sql(sql)
print(df)

Hard-Won Lessons After 6 Months in Production

Deploying AI in production is much harder than running a local demo. Here are the key takeaways:

  • Meaningful Naming: If your DB is full of table_1 and col_a, the AI will definitely guess wrong. Write detailed documentation for columns with ambiguous names.
  • Prioritize Security: Never use the root user. Create a Read-Only user and only grant access to necessary tables, staying away from sensitive info like passwords or tokens.
  • Feedback Loop: Every week, I check the questions the AI answered incorrectly. I fix the SQL and feed it back into vn.train(). The system gets noticeably smarter in just 2-3 weeks.
  • Leverage Visualization: Vanna integrates Plotly seamlessly. Asking the AI to generate charts directly from query results usually leaves managers very impressed.

Conclusion

Querying data using natural language is no longer science fiction. With Vanna.ai, building a Text-to-SQL system has become more practical and accessible than ever.

AI can’t replace Data Engineers for complex financial reports yet. However, it’s more than capable of handling 80% of mundane daily query requests. Good luck freeing up your workflow—no more “Hey, can you check this for me” in the middle of your lunch break!

Share: