Haystack 2.0: Building Document Processing Pipelines with Document Indexing, Hybrid Retrieval, and Smart Q&A – ITFROMZERO

Table of Contents

The Problem with Typical RAG Pipelines

I once built an internal Q&A system with LangChain — it worked, but when I needed to add a new processing step in the middle of the pipeline, everything started getting messy. Code nested callbacks inside callbacks, and I spent an entire afternoon just trying to figure out where the data was flowing.

Haystack (by deepset) solves exactly that pain point: the pipeline is a directed acyclic graph (DAG), with each component connected explicitly. Need to add a preprocessing step? Just insert the component. Want to swap the embedding model? Change one line. No need to touch the rest of the logic.

This article dives straight into 3 practical sections: indexing documents correctly, hybrid retrieval for better accuracy, and a production-ready Q&A pipeline.

Quick Start — A Working Q&A Pipeline in 5 Minutes

Installation

pip install haystack-ai openai

The Simplest Q&A Pipeline

from haystack import Document, Pipeline
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.generators import OpenAIGenerator
from haystack.components.builders import PromptBuilder
from haystack.document_stores.in_memory import InMemoryDocumentStore

# 1. Document store + indexing
store = InMemoryDocumentStore()
store.write_documents([
    Document(content="Haystack is a Python framework for building AI document processing pipelines."),
    Document(content="BM25 is a keyword-based search algorithm, well-suited for technical documents."),
    Document(content="Embedding retrieval uses vector similarity to find semantically related documents."),
])

# 2. Build pipeline
template = """
Based on the following documents, answer the question:

{% for doc in documents %}
{{ doc.content }}
{% endfor %}

Question: {{ question }}
"""

pipe = Pipeline()
pipe.add_component("retriever", InMemoryBM25Retriever(document_store=store))
pipe.add_component("prompt", PromptBuilder(template=template))
pipe.add_component("llm", OpenAIGenerator(model="gpt-4o-mini"))

pipe.connect("retriever", "prompt.documents")
pipe.connect("prompt", "llm")

# 3. Run
result = pipe.run({
    "retriever": {"query": "What is Haystack?"},
    "prompt": {"question": "What is Haystack?"}
})
print(result["llm"]["replies"][0])

This structure is much clearer than nested chains: each component receives input, processes it, and returns output — the pipeline connects them together. Easy to debug, easy to test each step individually.

Document Indexing — Getting It Right from the Start

Most RAG systems produce poor results not because the LLM is bad, but because the indexing is wrong. Documents chunked too large, no metadata, or text that hasn’t been cleaned after converting from PDF.

Complete Indexing Pipeline for PDFs

from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.embedders import OpenAIDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from pathlib import Path

store = InMemoryDocumentStore()

indexing_pipe = Pipeline()
indexing_pipe.add_component("converter", PyPDFToDocument())
indexing_pipe.add_component("cleaner", DocumentCleaner(
    remove_empty_lines=True,
    remove_extra_whitespaces=True,
))
indexing_pipe.add_component("splitter", DocumentSplitter(
    split_by="sentence",
    split_length=5,   # 5 sentences per chunk
    split_overlap=1,  # 1 sentence overlap to preserve context at boundaries
))
indexing_pipe.add_component("embedder", OpenAIDocumentEmbedder(
    model="text-embedding-3-small"
))
indexing_pipe.add_component("writer", DocumentWriter(document_store=store))

indexing_pipe.connect("converter", "cleaner")
indexing_pipe.connect("cleaner", "splitter")
indexing_pipe.connect("splitter", "embedder")
indexing_pipe.connect("embedder", "writer")

# Index all PDFs in the directory
pdf_files = list(Path("./docs").glob("*.pdf"))
indexing_pipe.run({"converter": {"sources": pdf_files}})
print(f"Indexed {store.count_documents()} chunks")

For technical documents, split_by="sentence" and split_length=5 is a combination I often use — smaller chunks improve retrieval accuracy, and 1-sentence overlap prevents context loss at chunk boundaries.

Adding Metadata for Later Filtering

from haystack import Document

docs = [
    Document(
        content="Technical documentation content about Docker...",
        meta={
            "source": "docker-guide.pdf",
            "category": "devops",
            "language": "vi",
        }
    )
]
store.write_documents(docs)

# Retrieval with category filter
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
retriever = InMemoryBM25Retriever(document_store=store)
results = retriever.run(
    query="Docker configuration",
    filters={"field": "meta.category", "operator": "==", "value": "devops"}
)

Hybrid Retrieval — Combining BM25 and Embedding

BM25 excels with exact keywords: proper names, error codes, specific commands. Embedding retrieval excels with questions phrased differently but with the same meaning. Hybrid combines both — this is why retrieval quality improves noticeably in practice.

from haystack import Pipeline
from haystack.components.retrievers.in_memory import (
    InMemoryBM25Retriever,
    InMemoryEmbeddingRetriever
)
from haystack.components.embedders import OpenAITextEmbedder
from haystack.components.joiners import DocumentJoiner
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator

template = """
Based on the following documents, answer the question:
{% for doc in documents %}
- {{ doc.content }}
{% endfor %}
Question: {{ question }}
"""

query_pipe = Pipeline()
query_pipe.add_component("text_embedder", OpenAITextEmbedder(
    model="text-embedding-3-small"
))
query_pipe.add_component("embedding_retriever", InMemoryEmbeddingRetriever(
    document_store=store, top_k=10
))
query_pipe.add_component("bm25_retriever", InMemoryBM25Retriever(
    document_store=store, top_k=10
))
query_pipe.add_component("joiner", DocumentJoiner(
    join_mode="reciprocal_rank_fusion"  # RRF: combines best rankings
))
query_pipe.add_component("prompt", PromptBuilder(template=template))
query_pipe.add_component("llm", OpenAIGenerator(model="gpt-4o-mini"))

# Connect
query_pipe.connect("text_embedder.embedding", "embedding_retriever.query_embedding")
query_pipe.connect("embedding_retriever", "joiner")
query_pipe.connect("bm25_retriever", "joiner")
query_pipe.connect("joiner", "prompt.documents")
query_pipe.connect("prompt", "llm")

# Run
result = query_pipe.run({
    "text_embedder": {"text": "How do I debug containers that won't start?"},
    "bm25_retriever": {"query": "How do I debug containers that won't start?"},
    "prompt": {"question": "How do I debug containers that won't start?"}
})
print(result["llm"]["replies"][0])

From an internal knowledge base project: hybrid retrieval improved recall by roughly 15–20% compared to embedding alone. The difference was most pronounced when queries contained product names, error codes, or specific commands — exactly what BM25 handles better than embedding.

Advanced — Persistent Store and Production Tips

Using ChromaDB Instead of InMemory

pip install chroma-haystack

from chroma_haystack import ChromaDocumentStore

store = ChromaDocumentStore(
    collection_name="my_docs",
    persist_path="./chroma_db"  # Data persists after restart
)

Serializing Pipelines for Versioning and Reuse

import yaml

# Save pipeline config
with open("pipeline.yaml", "w") as f:
    yaml.dump(query_pipe.to_dict(), f)

# Load later — no need to rebuild from scratch
from haystack import Pipeline
with open("pipeline.yaml") as f:
    pipe_dict = yaml.safe_load(f)
query_pipe = Pipeline.from_dict(pipe_dict)

Debugging Each Component Independently

# Test the retriever without running the full pipeline
retriever = InMemoryBM25Retriever(document_store=store)
docs = retriever.run(query="Docker", top_k=3)

print(f"Found {len(docs['documents'])} documents")
for doc in docs["documents"]:
    print(f"Score: {doc.score:.3f} | {doc.content[:100]}...")

This is something I find Haystack does better than LangChain: testing each component in isolation. When the system gives a wrong answer, I can immediately narrow down whether the issue is in indexing, retrieval, or the prompt — no need to dump the entire chain log.

Practical Tips for Choosing Haystack

Use Haystack when you need a clear pipeline structure, multiple document sources, pipeline serialization/versioning, or a team maintaining it collaboratively.
Use LangChain when you need to prototype quickly with many ready-made integrations and are already familiar with the LangChain ecosystem.
Use custom code when the pipeline is only 1–2 steps and you don’t want a framework dependency.

One easy mistake to make: Haystack 2.0 completely changed its architecture from v1. If a tutorial you’re reading has a different-looking API — that’s the old v1. The current package is haystack-ai, not farm-haystack. V2 is more type-safe, has more explicit pipelines, and is significantly easier to debug.