Lightning-Fast PDF Processing with Python and PyMuPDF: From Simple Scripts to Production

Python tutorial - IT technology blog
Python tutorial - IT technology blog

Handling PDFs with Code: Why PyMuPDF?

Processing PDFs via code is often a nightmare if you choose the wrong tool. I used to struggle with PyPDF2 and PDFMiner. The result? When the project needed to handle 5,000 invoices a day, the system started to “choke” and crashed constantly due to high RAM consumption.

After 6 months of operating a real-world warehouse system, I can confirm: PyMuPDF (the fitz library) is the most superior choice today. Built on MuPDF’s C engine, it is 5 to 10 times faster than pure Python libraries. From reading, editing, and deleting pages to drawing graphics, it handles everything.

My script started at just 200 lines and has now grown to over 2,000 lines to handle all sorts of edge cases. Here are the most practical experiences I’ve distilled for you.

Installation and Reading Content: Done in 5 Minutes

Installation is extremely simple via pip. There’s a slight quirk: the package name is pymupdf, but in your code, you must import fitz. This is a bit of a “historical legacy” for this library—just use it without overthinking it.

pip install pymupdf

Try this basic code snippet to read content:

import fitz  # Import PyMuPDF

# Open file with a single line of code
doc = fitz.open("document.pdf")

# Access the first page (index starts at 0)
page = doc[0]
text = page.get_text()

print(f"Page 1 content:\n{text}")
doc.close()

Looks smooth, doesn’t it? But in reality, projects are never that simple.

Real-world Techniques: PDF Extraction and Merging

1. Extracting Structured Text (Blocks)

Using standard get_text() sometimes causes text to jump around because PDF structures are inherently complex. My solution is to use the "blocks" parameter. It helps you retrieve data block by block, maintaining the logical top-to-bottom reading order.

doc = fitz.open("report.pdf")
for page in doc:
    # Returns a list of tuples containing coordinates and content
    blocks = page.get_text("blocks")
    for b in blocks:
        print(f"Position: {b[:4]} | Content: {b[4]}")

2. Merging Multiple PDF Files

When building a tool to aggregate end-of-month invoices, I really fell in love with the insert_pdf feature. It allows you to pick specific pages from file A and place them into file B without losing formatting or attachments.

def merge_pdfs(file_list, output_name):
    result = fitz.open()
    for file in file_list:
        with fitz.open(file) as m_file:
            result.insert_pdf(m_file)
    result.save(output_name)
    result.close()

merge_pdfs(["part1.pdf", "part2.pdf"], "final_report.pdf")

Adding Watermarks and Professional Security

“Can you insert a faint CONFIDENTIAL watermark on all pages?” If your boss asks that, don’t worry. With PyMuPDF, this only takes a few lines of code instead of hours of manual editing.

def add_watermark(input_pdf, output_pdf, text):
    doc = fitz.open(input_pdf)
    for page in doc:
        # Insert text at a 45-degree angle at coordinates (100, 100)
        page.insert_text((100, 100), text, 
                         fontsize=60, 
                         rotate=45, 
                         color=(0.8, 0.8, 0.8), 
                         fill_opacity=0.3)
    doc.save(output_pdf)
    doc.close()

Want to insert a company logo? Just change insert_text to insert_image and define the desired coordinate area (Rect).

Hard-earned Lessons for Production

After many “painful” encounters with corrupted PDF files, here are the key points you must memorize:

  • Memory Management: Never forget doc.close(). When processing thousands of large files, failing to close them will drain the server’s RAM in minutes. It’s best to use with fitz.open(...) as doc:.
  • Output Compression: PDF files often bloat after editing. Use doc.save(filename, garbage=4, deflate=True) to remove junk and compress data. File size can be reduced by 30-50%.
  • Handling Font Errors: If the extraction results in strange characters (tofu blocks), it’s because the font isn’t embedded correctly. In this case, you’ll need to combine OCR (like Tesseract) to scan images instead of reading plain text.
  • Coordinate System: Always remember that (0,0) is at the top-left corner of the page. To insert something in the bottom-right corner, you must subtract the margins from the page width and height.

In summary, PyMuPDF is a true “beast”: fast, powerful, and extremely stable. If you are building a document automation system, start using it today. Good luck with your project!

Share: