The Problem: LLMs Are Smart But Know Nothing About Your Data
ChatGPT and Claude handle general knowledge questions remarkably well. But ask them about your company’s internal documents, your product database, or the technical PDFs for your current project — and you’ll get hallucinated or irrelevant answers. This isn’t a flaw in the model; they simply don’t have access to your private data.
Fine-tuning sounds appealing — but it’s far from straightforward in practice. A single training run can cost anywhere from a few hundred to several thousand dollars depending on model size, requires substantial GPU resources, and every time your documents are updated, you have to retrain from scratch. I went down that road and gave up after two weeks when I ran the numbers on ongoing operational costs.
RAG (Retrieval-Augmented Generation) is a far more pragmatic approach. Instead of cramming knowledge into the model, you let it look up relevant documents before answering — like handing it a stack of reference materials right when it needs them.
Core Concepts You Need to Understand Before Coding
RAG Works in 3 Steps
- Indexing: Documents are split into smaller chunks, converted into vector embeddings, and stored in a vector database.
- Retrieval: When a user asks a question, the system finds document chunks whose vectors are closest to the query.
- Generation: The retrieved chunks and the question are fed into the LLM to generate a grounded, hallucination-free answer.
LangChain Building Blocks
LangChain breaks the pipeline into 6 building blocks. Understanding each one will save you a lot of debugging time when things go wrong:
- Document Loaders: Load documents from PDFs, TXT files, URLs, databases…
- Text Splitters: Break long text into smaller chunks
- Embeddings: Convert text into numerical vectors
- Vector Stores: Store and search vectors (Chroma, FAISS, Pinecone…)
- Retrievers: A standardized interface for querying the vector store
- Chains: Connect components into a complete pipeline
Hands-On: Building RAG from Scratch
Step 1: Install Dependencies
Create a virtualenv and install the required packages:
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install langchain langchain-community langchain-openai
pip install chromadb
pip install pypdf # if you need to read PDFs
pip install python-dotenv
Create a .env file to store your API key:
OPENAI_API_KEY=sk-...your-key-here...
Step 2: Prepare Documents and Build the Vector Store
The demo below uses plain text for clarity. Once it’s working, you can swap in a PDF loader or web scraper — the code structure stays the same.
from dotenv import load_dotenv
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
load_dotenv()
# Load documents
loader = TextLoader("docs/manual.txt", encoding="utf-8")
documents = loader.load()
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # Max 500 characters per chunk
chunk_overlap=50, # 50-character overlap to preserve context
length_function=len,
)
chunks = splitter.split_documents(documents)
print(f"Total chunks: {len(chunks)}")
# Create embeddings and save to Chroma
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db" # Persist to disk for reuse
)
print("Vector store created successfully!")
Chunk size has a significant impact on quality: 500–800 characters generally works well for technical documentation. Too small (under 200) and you lose context; too large (over 1500) and retrieval accuracy drops because the vectors become “blurry”. Experiment with your actual data before settling on a number.
Step 3: Build the RAG Chain
This section connects the retriever to the LLM — and it’s where you control how the model responds:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
# Reload vector store from disk (if already created)
embeddings = OpenAIEmbeddings()
vectorstore = Chroma(
persist_directory="./chroma_db",
embedding_function=embeddings
)
# Create retriever — fetch the 3 most relevant chunks
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 3}
)
# Custom prompt so the model answers based on context
prompt_template = """You are a technical assistant. Use the following information to answer the question.
If you cannot find relevant information, clearly state that it is not covered in the documentation.
Reference information:
{context}
Question: {question}
Answer:"""
PROMPT = PromptTemplate(
template=prompt_template,
input_variables=["context", "question"]
)
# Create chain
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # "stuff" = pack all context into a single prompt
retriever=retriever,
chain_type_kwargs={"prompt": PROMPT},
return_source_documents=True # Return source documents for debugging
)
Step 4: Run a Test and See the Results
def ask(question: str):
result = qa_chain.invoke({"query": question})
print(f"\nQuestion: {question}")
print(f"\nAnswer: {result['result']}")
print("\n--- Source Documents ---")
for i, doc in enumerate(result['source_documents']):
print(f"[{i+1}] {doc.page_content[:150]}...")
print(f" (from file: {doc.metadata.get('source', 'unknown')})")
print()
# Test it out
ask("What is the process for deploying the application to production?")
ask("How do you handle timeout errors in the system?")
Step 5: Upgrade — Using LCEL (LangChain Expression Language)
LangChain version 0.2 and later encourages using LCEL over legacy chains. The code is shorter, easier to extend with additional processing steps, and streaming support is built in:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
# LCEL chain — readable and easy to extend
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| PROMPT
| llm
| StrOutputParser()
)
# Stream response (ideal for web apps)
for chunk in rag_chain.stream("How do I set up the development environment?"):
print(chunk, end="", flush=True)
print()
Handling PDF Documents
Most real-world projects I’ve worked on have data in PDF format — technical documentation, spec sheets, company policies. Just swap the loader and everything else stays the same:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("docs/technical-spec.pdf")
pages = loader.load() # Each page becomes a Document
# Then split and index as usual
chunks = splitter.split_documents(pages)
Summary and Next Steps
I used this approach to build an internal technical support chatbot — roughly 200 documents, about 50MB of PDFs in total. After two months in production, the question-answering accuracy was around 85%, compared to 0% without RAG. The critical factor wasn’t the code — it was the quality of the input data: well-structured, clear documents produce great results; blurry scans or disorganized content will defeat even the most polished pipeline.
Want to push accuracy even higher? Here are some directions worth exploring:
- Hybrid Search: Combine vector search with BM25 (keyword search) to improve accuracy
- Reranking: Use a cross-encoder model to re-rank results before passing them to the LLM
- Conversation Memory: Add chat history so the system retains context across the conversation
- Alternative Vector Stores: For production-grade deployments, consider Pinecone, Weaviate, or pgvector (PostgreSQL)
- Alternative Embeddings: Don’t want to use OpenAI? Try
HuggingFaceEmbeddingswith theBAAI/bge-m3model — it handles multiple languages well and is completely free
Start with a small dataset — 20 to 50 documents is enough to see the system working. Measure accuracy by asking questions yourself and comparing the answers to what you expect, then scale up from there. Skipping the evaluation step is the most common mistake — discovering problems only after deploying to production is expensive to fix.

