PostgreSQL Anonymizer: The Ultimate Trick to Mask Production Data for Staging Without PII Leaks – ITFROMZERO

Table of Contents

The Fear of Losing Your Career When Moving Production Data to Staging/Dev

Every time a DBA or Lead Dev hears the request “Can I get a Prod data dump for an urgent debug?”, they get a nervous twitch. The problem is that real data is full of personally identifiable information (PII) like emails, phone numbers, or credit card details.

Copying a several hundred GB database to a personal machine is like putting one foot into a data breach risk. However, using only mock data leaves the dev team helpless against weird logic cases that only occur in real data. After 6 months of integrating PostgreSQL Anonymizer (anon) into our CI/CD pipeline, I’ve found it to be the best balance between security and practicality.

Three Common Data Anonymization Methods (And Why You Should Choose an Extension)

Before diving into the technical details, let’s look back at traditional methods to see the difference.

1. Running Manual SQL Scripts

You dump the data and then run an update_masking.sql file like UPDATE users SET email = '[email protected]'.

Weakness: Extremely slow on tables with over 10 million records because UPDATE commands cause high IO and database bloat. It’s also very easy to miss new fields as the schema constantly changes.

2. Using External ETL Tools

Using Airflow or Python scripts to pull data, process it intermediately, and then push it to Staging.

Weakness: Requires additional intermediate server costs. Configuring connections between environments also makes the system bulky and difficult to maintain.

3. PostgreSQL Anonymizer (Extension)

This is an extension that runs directly inside Postgres. It allows you to define masking rules right on the schema as part of the database configuration.

Strength: Declare once, use forever. It supports built-in mock data generation functions and achieves impressive speeds when combined with anon.dump().

Why PostgreSQL Anonymizer is the “One”?

Here are 3 reasons why I decided to stick with this tool:

Declarative Declaration: You simply use the MASKED WITH command to specify the columns to hide. When looking at the schema, you immediately know which columns are protected without digging through old scripts.
Dynamic Masking Mechanism: You can create a specific dev_user. When this user queries, Postgres automatically returns anonymized data. Meanwhile, the admin user still sees the real data for auditing.
Format Preserving: Fake emails still have the @ domain, and phone numbers still have 10 digits. This prevents the code from crashing due to data validation errors.

If you need to quickly process a few CSV files containing user lists into JSON for script testing, try the converter at toolcraft.app/en/tools/data/csv-to-json. It runs 100% in the browser, so there’s no risk of leaking data to a server—perfect for handling sensitive files quickly.

Practical Installation and Configuration Guide

Step 1: Install the Extension

With Docker, use the image registry.gitlab.com/dalibo/postgresql_anonymizer. On Ubuntu, install the package corresponding to your Postgres version:

# Install for Postgres 15
sudo apt-get install postgresql-15-anonymizer

Open the postgresql.conf file and add anon to the startup libraries:

shared_preload_libraries = 'anon'

Don’t forget to restart Postgres for the changes to take effect.

Step 2: Initialize in the Database

Run the following commands to activate the anonymization features:

CREATE EXTENSION IF NOT EXISTS anon CASCADE;
SELECT anon.init();

Step 3: Set Up Anonymization Rules

Suppose you need to protect the users table. Try these common rules:

-- Replace real names with random first names
SECURITY LABEL FOR anon ON COLUMN users.full_name
  IS 'MASKED WITH FUNCTION anon.fake_first_name()';

-- Mask email, keeping only the first and last 2 characters
SECURITY LABEL FOR anon ON COLUMN users.email
  IS 'MASKED WITH FUNCTION anon.partial(email,2,$$******$$,2)';

-- Generate a fake 10-digit phone number
SECURITY LABEL FOR anon ON COLUMN users.phone
  IS 'MASKED WITH FUNCTION anon.random_string(10)';

How to Dump Clean Data to the Staging Environment

Instead of using standard pg_dump, we will use the extension’s specialized tool to export “cleaned” data.

# Export masked data directly to a SQL file
pg_dump_anon -h localhost -U postgres my_prod_db > dump_anonymized.sql

At this point, pg_dump_anon will automatically scan the SECURITY LABEL and replace sensitive data with fake values. The dump_anonymized.sql file is now perfectly safe to share on Slack or send to the Dev team.

Lessons Learned After 6 Months of Implementation

I’ve gathered a few notes to save you some time:

Performance: For tables with tens of millions of rows, the dumping process is about 2-3 times slower than usual. You should schedule it to run during off-peak hours like 2-3 AM.
Determinism: If you want ID 123 in Prod to always have the same fake name in Staging for easier log tracking, use anon.hash() instead of random fake_name() functions.
Avoid Masking Foreign Keys: Never touch ID columns used for joining tables. If you mask these columns incorrectly, your database will turn into a mess and become unusable.

Conclusion

PostgreSQL Anonymizer makes the data synchronization process much more professional. Instead of living in fear every time you hand over data, you only need to define the rules once. If your project handles real user data, implement this now to protect both your customers and your own career.