Git Filter-Repo: Permanently Remove Sensitive Data and Large Files from Repository History

Git tutorial - IT technology blog
Git tutorial - IT technology blog

2 AM and the repo just leaked a secret

I still remember that night. A new developer on the team committed a .env file straight to main, complete with the production database password and AWS secret key. Even though the file was deleted in the next commit, everyone knows — Git never truly forgets. Anyone who clones the repo can run git log -p and see that secret sitting right there in history, fully exposed.

GitHub then sent a warning email: “We found a secret in your repository history.” AWS had already automatically disabled the key before I could do anything. That’s when I learned how to use git-filter-repo — and why you should know this tool before you ever need it.

Why git-filter-repo

There are several ways to rewrite Git history, but each has its own problems:

  • git filter-branch: Built into Git but painfully slow on large repos, error-prone, and Git itself has long recommended against it.
  • BFG Repo Cleaner: Fast and easy to use, but requires Java and lacks flexibility for complex cases.
  • git-filter-repo: Written in Python, officially recommended by the Git project as the replacement for filter-branch, and 10–50x faster in practice.

Concretely: on a repo with 2GB of history, filter-branch took 40 minutes. git-filter-repo finished in under 3 minutes. No further justification needed.

Installing git-filter-repo

Check your Git version first — you need 2.22.0 or later:

git --version
# git version 2.43.0

Install via pip (Python 3.6+):

# System-wide install
pip3 install git-filter-repo

# Or inside a virtualenv
pip install git-filter-repo

# Verify the installation
git filter-repo --version
# 2.45.0

On macOS with Homebrew:

brew install git-filter-repo

On Debian/Ubuntu:

sudo apt install git-filter-repo

The golden rule before you start: never work directly on the original repo. Clone a separate copy, make a backup, then proceed:

# Mirror clone for backup
git clone --mirror https://github.com/yourorg/your-repo.git repo-backup.git

# Fresh clone to work with
git clone https://github.com/yourorg/your-repo.git repo-clean
cd repo-clean

Removing sensitive files and specific data

Completely remove a file from every commit

This covers 90% of cases — removing a .env file, config/secrets.yml, or anything else that should never have been in the repo:

# Remove .env from the entire history
git filter-repo --path .env --invert-paths

# Remove multiple files at once
git filter-repo --path .env --path config/secrets.yml --path credentials.json --invert-paths

# Remove by pattern (all .pem files)
git filter-repo --path-glob '*.pem' --invert-paths

# Remove an entire directory
git filter-repo --path secrets/ --invert-paths

The --invert-paths flag means “remove these paths, keep everything else” — the opposite of the default behavior.

Replace sensitive content inside files (without deleting the file)

Need to keep the file but scrub the secret values inside it? Create a map file with the strings to replace:

# Create a file with strings to replace
cat > expressions.txt << 'EOF'
literal:sk-ant-api03-AbCdEf123456789===>***REMOVED***
literal:AKIAIOSFODNN7EXAMPLE===>***AWS_KEY_REMOVED***
EOF

# Apply it
git filter-repo --replace-text expressions.txt

The format is old_string===>new_string, with regex support as well:

cat > expressions.txt << 'EOF'
regex:password=\S+===>password=***REMOVED***
regex:api_key:\s*['"]\S+['"]===>api_key: "***REMOVED***"
EOF

git filter-repo --replace-text expressions.txt

Remove large files

On my team, someone once accidentally committed the entire node_modules directory along with a 500MB demo video. The repo ballooned from 50MB to 600MB, and cloning it took half a morning. Here’s how to fix it:

# Analyze repo history and find large files
git filter-repo --analyze
# Results are written to .git/filter-repo/analysis/

# View the largest files
cat .git/filter-repo/analysis/path-all-sizes.txt | sort -rn | head -20

The output looks like this:

=== All paths by reverse size ===
Format: size, packed size, date deleted, path name
 524288000   498234112 2024-03-15 assets/demo-video.mp4
 145234567   138234089 2023-11-20 node_modules.tar.gz
  45678901    43211234 2024-01-08 dist/bundle.min.js
# Remove a specific file by path
git filter-repo --path assets/demo-video.mp4 --invert-paths

# Remove all files larger than 10MB
git filter-repo --strip-blobs-bigger-than 10M

Verifying the results and pushing to remote

Confirm the data has been removed

# Check that the file no longer appears in history
git log --all --full-history -- .env
# No output = successfully removed

# Search for a string across the entire history
git log --all -p | grep -i "sk-ant-api"
# No output = secret has been purged

# Check repo size after removal
git count-objects -vH

Clean up and garbage collect

git-filter-repo runs cleanup automatically, but to be safe:

# Force expire all reflogs
git reflog expire --expire=now --all

# Aggressive garbage collect
git gc --prune=now --aggressive

# Check size before and after
git count-objects -vH

Force push to remote

There’s no way around this step — rewriting history means you have to force push:

# Re-add the remote (git-filter-repo removes it automatically to prevent accidental pushes)
git remote add origin https://github.com/yourorg/your-repo.git

# Force push all branches
git push origin --force --all

# Force push all tags
git push origin --force --tags

After the force push, everyone on the team needs to reset — their local copies of the old repo are no longer usable:

# Each team member needs to run
git fetch --all
git reset --hard origin/main

# Or the simpler approach: delete the old repo and re-clone
rm -rf old-repo/
git clone https://github.com/yourorg/your-repo.git

Revoke and rotate all exposed secrets

Removing the data from Git history is not the finish line. If the repo was ever public, or anyone cloned it before you could act — treat those secrets as fully compromised. No exceptions:

  • Revoke and regenerate API keys and passwords immediately
  • Rotate AWS credentials and database passwords
  • Review access logs to check whether the keys were used by anyone
  • Enable GitHub secret scanning to get early warnings going forward

Prevention for next time

After that night, I immediately set up a pre-commit hook to check for secrets before every commit. I use gitleaks — lightweight, no extra runtime required:

# Install gitleaks
brew install gitleaks  # macOS
# or
wget https://github.com/gitleaks/gitleaks/releases/latest/download/gitleaks_linux_x64.tar.gz

# Scan the current repo
gitleaks detect --source . --verbose

# Add to pre-commit hook
cat > .git/hooks/pre-commit << 'EOF'
#!/bin/bash
gitleaks protect --staged --verbose
if [ $? -ne 0 ]; then
    echo "Gitleaks detected a secret! Commit blocked."
    exit 1
fi
EOF
chmod +x .git/hooks/pre-commit

Finally, add .env to .gitignore. It sounds obvious, but it’s the most commonly overlooked step:

echo '.env' >> .gitignore
echo '*.pem' >> .gitignore
echo 'credentials.json' >> .gitignore
git add .gitignore
git commit -m "chore: ignore sensitive files"

Since rolling out the pre-commit hook across our team of eight, we haven’t had a single secret leak incident — and I sleep a lot better at night.

Share: