Generating Millions of Logically Consistent Database Records: The Power of Combining LLM and SDV

Artificial Intelligence tutorial - IT technology blog
Artificial Intelligence tutorial - IT technology blog

The 2 AM Incident and the Clean Data Problem

Picture this: it’s 2 AM and your phone won’t stop buzzing. The production system has hit a bizarrely elusive bug — one that only surfaces when users perform overlapping transactions within a short time window. You scramble to your laptop, ready to dump the database locally to debug. Then you freeze.

GDPR compliance policies and company regulations strictly prohibit copying real data to personal machines. Exposing even a single row of personally identifiable information (PII) could end your career with a devastating legal penalty. You stare at the empty local database, completely stuck.

How do you get 1 million records that feel “real” while preserving foreign key relationships to reproduce the bug? Faker produces data that’s too disconnected. Calling ChatGPT directly is too slow and expensive. That’s when I found the formula: use an LLM to understand the context and the SDV (Synthetic Data Vault) library to execute. I applied this approach to an e-wallet system with over 50 linked tables — the results were impressive.

Comparing Synthetic Data Generation Approaches

Before diving into code, let’s look at the 3 most common methods today to understand why the LLM + SDV combo comes out on top.

1. Faker or Mockaroo Libraries

  • Pros: Fast, easy to deploy for individual tables.
  • Cons: No relational awareness. Faker might generate an order with a delivery date before the order date. It has absolutely no understanding of your business logic.

2. Direct Prompting with LLMs (ChatGPT/Claude)

  • Pros: Highly intelligent — understands complex constraints from plain descriptions.
  • Cons: Token limits are a major bottleneck. You can’t ask ChatGPT to generate 10GB of CSV data without interruptions or paying hundreds of dollars in API fees.

3. The SDV Library (Synthetic Data Vault)

  • Pros: A dedicated Python tool. It learns statistical distributions from a small data sample and scales it up while preserving the original characteristics.
  • Cons: Configuring the Metadata (defining table relationships) is painful for complex database schemas.

Why LLM + SDV Is the Perfect Pair

SDV’s biggest barrier is writing the Metadata file. With hundreds of tables, manual definition is error-prone. This is exactly where LLMs shine. I use an LLM to read the SQL schema, then ask it to generate the SDV configuration file. SDV then handles the “grunt work” of producing massive volumes of data.

This approach guarantees three core properties: Correct business logic — Preserved foreign key relationships — Large-scale data volume.

Step-by-Step Implementation Guide

Step 1: Set Up the Environment

Use a virtual environment to avoid library conflicts. SDV has significant mathematical dependencies, including PyTorch and Scikit-learn.

pip install sdv pandas

Step 2: Use an LLM to Generate Metadata

Say you have two tables: users and transactions. Don’t manually type out the JSON. Paste your SQL schema into Claude 3.5 or GPT-4 with the following prompt:

“Based on this SQL schema, generate a Metadata file following the SDV 1.0 standard. Clearly define the Primary Key, Foreign Key, and sdtype data type for each column.”

You’ll receive a configuration structure like this:

metadata_dict = {
    "tables": {
        "users": {
            "primary_key": "user_id",
            "columns": {
                "user_id": {"sdtype": "id"},
                "email": {"sdtype": "email"},
                "age": {"sdtype": "numerical", "computer_representation": "Int64"}
            }
        },
        "transactions": {
            "primary_key": "tx_id",
            "columns": {
                "tx_id": {"sdtype": "id"},
                "user_id": {"sdtype": "id"},
                "amount": {"sdtype": "numerical", "computer_representation": "Float"}
            }
        }
    },
    "relationships": [
        {
            "parent_table_name": "users",
            "child_table_name": "transactions",
            "parent_primary_key": "user_id",
            "child_foreign_key": "user_id"
        }
    ]
}

Step 3: Train the Model

All you need is a sample data file of around 100 rows. SDV learns the patterns from this file to understand the data distribution.

from sdv.multi_table import HMAVSynthesizer
from sdv.metadata import MultiTableMetadata
import pandas as pd

# Initialize metadata from the LLM-generated configuration
metadata = MultiTableMetadata.from_dict(metadata_dict)

# Load sample data (PII already scrubbed)
data = {
    'users': pd.read_csv('users_sample.csv'),
    'transactions': pd.read_csv('transactions_sample.csv')
}

# Train the model using the HMAV algorithm
synthesizer = HMAVSynthesizer(metadata)
synthesizer.fit(data)
synthesizer.save('finance_model.pkl')

Step 4: Generate Data at Scale

With a single command, you can produce data hundreds of times larger than your original sample.

# Generate data at 100x the sample size
synthetic_data = synthesizer.scale(scale=100) 

# Export results for loading into the database
synthetic_data['users'].to_csv('synthetic_users.csv', index=False)
synthetic_data['transactions'].to_csv('synthetic_transactions.csv', index=False)

Real-World Experience: Pitfalls to Avoid

From hands-on project work, I’ve distilled 3 key lessons to keep your data from turning to garbage:

  1. Temporal constraints: SDV doesn’t automatically know that end_date must come after start_date. You need to use SDV’s Constraints feature. Ask your LLM to extract these rules from your SRS documentation and wire them into the code.
  2. Accuracy control: Always run synthesizer.get_score() after generation. If the score is below 0.7 (70%), the synthetic data has drifted too far from reality. In that case, improve the quality of your initial 100-row sample.
  3. Custom formats: If you need company-domain emails (like @vinadata.com), configure the sdtype as a custom formatter instead of using the default.

Wrapping Up

Synthetic data generation isn’t just about filling a database to make it look good. It’s a powerful tool for load testing, logic verification, and protecting your reputation against strict data security regulations. Instead of spending 3 days writing manual scripts, the LLM + SDV combo helped me build a quality dataset in under 2 hours.

Don’t wait for a midnight incident to start looking for solutions. Try building a Synthetic Data pipeline today — it’ll make debugging a whole lot less stressful.

Share: