PostgreSQL Dynamic Data Masking: Stop Leaking Customer Data in Dev/Test

Database tutorial - IT technology blog
Database tutorial - IT technology blog

Context: The Nightmare of Customer Data Leaks

As a data administrator, you’ve likely faced a frustrating situation: the Dev team needs real production data to fix a “tough” logic bug, or the Test team needs diverse data for automation. The problem is, the database contains sensitive information like phone numbers, emails, or account balances.

Moving that raw data to a personal machine or a Staging environment is a security gamble. If a developer’s laptop is lost or a backup is accidentally left on an unsecured cloud, all customer information could be sold instantly.

I once worked on a Fintech project with a database over 500GB. Every time we synced data to the Lab, the team spent the whole morning writing UPDATE scripts to overwrite it with dummy data. This manual approach caused transaction logs to skyrocket, bloated the database by dozens of GBs, and required hours of waiting for scripts to finish on tables with tens of millions of records.

Dynamic Data Masking (DDM) is the solution to this problem. Instead of modifying physical data, DDM acts like a “filter.” The real data remains untouched on the disk, but depending on permissions, users will only see masked strings or completely fake data.

Installing postgresql_anonymizer

PostgreSQL does not support this feature natively like SQL Server Enterprise. However, the postgresql_anonymizer extension (often called anon) has become the gold standard for handling masking today.

1. Quick Deployment with Docker

The “instant” way to test it is using the official Docker image with the extension pre-installed:

docker run -d --name pg_anon -e POSTGRES_PASSWORD=mysecretpassword -p 5432:5432 registry.gitlab.com/dalibo/postgresql_anonymizer

2. Installing on a Linux Server

If you are running Ubuntu or Debian, you can install it directly from the PGDG repository:

# Install the extension for Postgres 15
sudo apt-get install postgresql-15-anonymizer

Next, you need to configure Postgres to automatically load this extension at startup by editing the postgresql.conf file:

shared_preload_libraries = 'anon'

After restarting Postgres, activate this tool in your database:

CREATE EXTENSION anon CASCADE;
SELECT anon.init();

Configuring Masking: Labeling to Hide Data

The mechanism of anon is very intuitive: You apply a Security Label to a column, and that column will be hidden according to the rules you choose.

Step 1: Set up a User for the Dev Team

First, create a separate user for the Test environment. This user will have limited permissions to see real data.

CREATE ROLE dev_user LOGIN PASSWORD 'password123';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO dev_user;

Step 2: Apply Masking Rules

Suppose the customers table needs to be secured. We will apply the three most common techniques:

  1. Partial Masking: Reveal only the first few characters of an email so it’s still identifiable but the actual address is hidden.
  2. Faking: Replace real names with random names (e.g., “Nguyễn Văn A” becomes “John Doe”).
  3. Destructive Masking: Insert random characters into the middle of a phone number.
-- Mask email, keeping the first 2 and last 2 characters
SECURITY LABEL FOR anon ON COLUMN customers.email
IS 'MASKED WITH FUNCTION anon.partial(email,2,$$******$$,2)';

-- Replace name with random fake data
SECURITY LABEL FOR anon ON COLUMN customers.full_name
IS 'MASKED WITH FUNCTION anon.fake_first_name()';

-- Blur phone number
SECURITY LABEL FOR anon ON COLUMN customers.phone
IS 'MASKED WITH FUNCTION anon.partial(phone,3,$$****$$,2)';

Step 3: Activate Anonymization Mode

Finally, you need to instruct Postgres to apply the masking rules to the dev_user:

SELECT anon.start_masking();
SECURITY LABEL FOR anon ON ROLE dev_user IS 'MASKED';

Verifying the Results

Compare the difference when querying the same row of data using two different accounts.

Admin (Sees everything):

-- Result: "Trần Văn B", "[email protected]", "0912345678"

Dev_user (Obfuscated data):

SET ROLE dev_user;
SELECT full_name, email, phone FROM customers LIMIT 1;
-- Result: "Robert", "tr******om", "091****78"

A few months ago, I applied this method to an ERP system. Instead of spending a whole day running obfuscation scripts, it took me exactly 15 minutes to configure the rules. Although the query performance for masked users dropped by about 5-10% due to CPU overhead from anonymization functions, the trade-off was absolute peace of mind.

Critical Lessons from Real-World Operation:

  • Performance Issues: Masking occurs at the time of SELECT. If you query millions of rows without a WHERE or LIMIT clause, server CPU usage will spike to process the fake_* functions.
  • UNIQUE Constraints: Be careful when masking columns with Unique indexes. If a fake function accidentally generates two identical values, Join statements might return results that deviate from business logic.
  • Safe Backup Feature: When using pg_dump with a masked user, the resulting dump will only contain fake data. This is an excellent way to send databases to outsourced partners without worrying about leaking trade secrets.

Data security doesn’t have to be a bottleneck that slows down progress. With Dynamic Data Masking, the Dev team still has high-quality data to work with, and you can sleep soundly every night.

Share: