Cleaning PDF/Office Data for RAG with Unstructured.io: Turning ‘Trash’ into Gold – ITFROMZERO

Table of Contents

Why Raw Data is a RAG “Nightmare”

After six months of working on real-world RAG (Retrieval-Augmented Generation) projects, I’ve learned one key lesson: 80% of a chatbot’s performance depends on data quality. No matter how powerful GPT-4 or Claude 3.5 are, they will struggle if the input is garbage. Imagine throwing a 50-page financial report filled with complex tables into standard text-reading libraries. The result is often a mess of scattered characters and lost context.

In practice, Unstructured.io has been a lifesaver. Instead of just reading raw text, it breaks down documents into Elements like Title, NarrativeText, Table, or List. This preserves the document’s logical structure. Instead of blind chunking by length, you can map titles to metadata, making future retrieval significantly more accurate.

Compared to PyPDF2 or default LangChain Loaders, Unstructured excels at handling Word (.docx), PowerPoint (.pptx), and images thanks to integrated OCR. It turns difficult formats into machine-understandable data.

Installation and Setup

You have two options: use their API or run the open-source library locally. For testing with non-sensitive data, the API is the fastest route. However, to fully master the pipeline, I recommend a local installation.

Unstructured has several system dependencies for OCR and file format processing. On Ubuntu/Debian, run the following command:

# Install necessary system libraries
sudo apt-get update
sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr libreoffice pandoc

Next, install the Python library. I usually use the [all-docs] version for full support from PDF to Office:

pip install "unstructured[all-docs]" langchain-unstructured

Advanced Extraction Pipeline Configuration

The heart of this library is the partition function. It automatically detects the file type and applies the appropriate extraction strategy. For production use, you should fine-tune the parameters instead of using defaults.

1. Layout-Preserving Extraction

I use the code below for processing complex technical documents. The hi_res strategy uses an AI model to recognize the page layout:

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="technical_report.pdf",
    strategy="hi_res",           # Use AI for layout detection (Title, Table, Figure)
    infer_table_structure=True,  # Extract table structure to HTML
    chunking_strategy="by_title", # Group text by the nearest title
    max_characters=1200,         # Limit chunk length to optimize context window
    combine_text_under_n_chars=250 # Avoid tiny, fragmented chunks
)

2. Data Cleaning

Extracted data often contains strange characters or line break issues. Unstructured provides built-in cleaners to tidy up this mess with just a few lines of code:

from unstructured.cleaners.core import clean, group_broken_paragraphs

for element in elements:
    # Remove extra whitespace, fix bullet points and incorrect line breaks
    element.text = clean(element.text, extra_whitespace=True, dashes=True, bullets=True)
    element.text = group_broken_paragraphs(element.text)

3. Table Processing – The Key to Accuracy

My advice: never leave tables as plain text. LLMs struggle to understand row-column relationships. Instead, extract tables as HTML. This structure helps GPT-4 understand table data up to 40% better than raw text.

tables = [el for el in elements if el.category == "Table"]
if tables:
    # Get table content as HTML for the Prompt
    table_html = tables[0].metadata.text_as_html
    print(f"Table extracted: {table_html[:100]}...")

Measurement and Optimization

When integrating into an automated system, quality control is mandatory. Pay attention to these two key metrics:

Latency: The hi_res strategy can take 10-15 seconds per PDF page because it runs a vision model. If speed is a priority, use fast for files that already have a text layer.
Metadata Accuracy: Check the output JSON file. Unstructured provides page_number and text coordinates. This information is invaluable for “Source Citation” features, allowing users to click and view the exact location in the original file.

from unstructured.staging.base import elements_to_json
elements_to_json(elements, filename="debug_output.json")

After implementing Unstructured in production, I noticed a 60% reduction in data preprocessing code. Retrieval accuracy increased by about 35% for complex documents. It’s a must-have tool if you want to build a truly professional RAG system.