The Problem: Stop Sending Raw Customer Data to OpenAI
Six months ago, I deployed a RAG Chatbot system for a large bank. The biggest challenge wasn’t prompt engineering or model selection, but Data Privacy. The client asked a very practical question: ‘How can we be sure that credit card numbers, phone numbers, or home addresses aren’t pushed directly to OpenAI or Claude’s servers?’.
For personal projects, the risk might be low. But in a corporate environment, leaking PII (Personally Identifiable Information) is a massive legal headache. Regulations like GDPR or Vietnam’s Decree 13/2023/ND-CP carry heavy penalties. After testing several options, I chose Microsoft Presidio. This toolkit provides a highly stable way to ‘cleanse’ data before it leaves your internal servers.
How Does Presidio Work?
Think of Microsoft Presidio as a ‘water filter’ for text data. You pour in raw text, it scans for names, emails, ID numbers… then replaces them with dummy labels or encrypts them. This system consists of two main components:
- Presidio Analyzer: Acts as the ‘detective’. It combines Regex, NLP models (Spacy, Transformers), and checksum logic (like the Luhn algorithm for credit card numbers) to track down sensitive entities.
- Presidio Anonymizer: Acts as the ‘editor’. Based on the Analyzer’s results, it performs actions such as: Replace, Redact, Hash, or Encrypt.
The biggest plus is its ability to understand context. Presidio doesn’t just rely on dry Regex. It can distinguish when ‘Washington’ is a person’s name versus a location, significantly reducing the False Positive rate.
Getting Started with Implementation
I’m using Python 3.10 for this project. Installing Presidio along with the Spacy language model only takes a few minutes:
pip install presidio-analyzer presidio-anonymizer spacy
python -m spacy download en_core_web_lg
1. Data Analysis with the Analyzer
The code below will help you ‘scan’ a user’s chat message for sensitive content:
from presidio_analyzer import AnalyzerEngine
analyzer = AnalyzerEngine()
text_to_analyze = "Hello admin, I am Nguyen Van A, my phone number is 0901234567. I want to ask about an order sent to 123 Le Loi Street."
# Specify the entities to find
results = analyzer.analyze(text=text_to_analyze, entities=["PERSON", "PHONE_NUMBER", "LOCATION"], language='en')
for res in results:
print(res)
Practical experience shows that Spacy’s en_core_web_lg model identifies Vietnamese names quite well even when using language='en'. However, to achieve over 95% accuracy for Vietnamese, you will need further fine-tuning in the next steps.
2. Data Anonymization with the Anonymizer
Once the Analyzer ‘detective’ has reported its findings, we proceed to ‘mask’ the data before sending it to the AI API.
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
anonymizer = AnonymizerEngine()
# Configuration: Replace names, mask phone numbers, and redact locations
operators = {
"PERSON": OperatorConfig("replace", {"new_value": "[NAME]"}),
"PHONE_NUMBER": OperatorConfig("mask", {"type": "mask", "masking_char": "*", "chars_to_mask": 6, "from_end": True}),
"LOCATION": OperatorConfig("redact", {})
}
anonymized_result = anonymizer.anonymize(
text=text_to_analyze,
analyzer_results=results,
operators=operators
)
print(anonymized_result.text)
The result: “Hello admin, I am [NAME], my phone number is 0901******. I want to ask about an order sent to .”. This string is now safe enough to send to GPT-4 without worrying about privacy policy violations.
3. Customizing for Local Data (Citizen ID, License Plates)
Presidio does not support Vietnam’s 12-digit Citizen ID (CCCD) by default. You need to define a Custom Recognizer using Regex like this:
from presidio_analyzer import PatternRecognizer, Pattern
# Define pattern for CCCD (12 consecutive digits)
cccd_pattern = Pattern(name="cccd_pattern", regex=r"\b\d{12}\b", score=0.5)
cccd_recognizer = PatternRecognizer(
supported_entity="VN_CCCD",
patterns=[cccd_pattern],
context=["id number", "cccd", "identification", "identity card"]
)
analyzer.registry.add_recognizer(cccd_recognizer)
Lessons from Production
After six months of operation, I’ve gathered three key lessons to keep the system from bottlenecking:
- Processing Speed: Don’t use Transformer models (BERT/RoBERTa) if your application requires real-time responses. The Spacy
lgmodel combined with Regex results in a latency of only about 50-100ms, which is 10 times faster than Transformers while maintaining high accuracy. - Controlling False Positives: Don’t use the default
score_threshold. Adjust this value to around 0.35 after testing with sample data to balance between missing entities and over-identification. - De-anonymization Mechanism: Sometimes the AI needs to respond using the customer’s actual name. You should store the mapping between the original and anonymized values in Redis indexed by session. When the AI returns a result containing [NAME], simply map it back before responding to the user.
Conclusion
Data security is not something you should ‘leave up to’ giants like OpenAI or Microsoft. Proactively filtering PII with Presidio helps you build trust with customers, especially in the FinTech and Healthcare sectors. With just a few lines of code, you’ve created a solid protective shield for your AI system.

