The Problem I Faced: LLMs Know Nothing About Internal Data
About 8 months ago, I was tasked with building a chatbot to help employees look up internal documents — over 400 PDF files covering company procedures, policies, and technical documentation. My initial thought was simple: use the ChatGPT API, pass the question in, get the answer back. Reality turned out to be far more complicated.
The first problem appeared in the very first week: the model gave completely wrong answers because it had no knowledge of the company’s internal processes. Ask “What is the leave request process?” — it would fabricate a process that sounded plausible but was entirely incorrect. Classic hallucination.
I tried stuffing all the document content into the prompt — and immediately hit the context window limit. 300 pages of PDFs can’t fit into 128k tokens, not to mention the API costs per token would explode.
The Root Cause: LLMs Are “Closed Libraries”
LLMs are trained on public data up to a certain point (training cutoff). After that, the model is “frozen” — it doesn’t learn on its own, and there’s no way for it to know about your company’s internal documents, private databases, or any private information.
When you ask about something the model doesn’t know, it has two options: say “I don’t know” (rare) or reason through it and fabricate a plausible-sounding answer (hallucination — extremely common). Both make the chatbot unreliable. With process documents or company policies, employees acting on incorrect information is worse than having no chatbot at all.
Three Approaches — and Why the First Two Don’t Work
Approach 1: Fine-tuning the Model
Fine-tuning means retraining the model with your own data. It sounds great in theory, but in practice there are issues:
- Compute costs are very high, requiring powerful GPUs
- Time-consuming: data preparation → train → evaluate → deploy pipeline
- When documents are updated (policy changes, new versions), you have to fine-tune from scratch
- Doesn’t solve the hallucination problem — the model can still “misremember”
Fine-tuning is appropriate when you want to change how the model responds — tone, format, language. But if you want to inject knowledge from internal documents? Wrong tool.
Approach 2: Stuffing All Context into the Prompt
The most direct approach: stuff all documents into the system prompt. For small document collections (under 50 pages) it’s passable, but:
- Context windows have limits (even GPT-4’s 128k or Claude’s 200k isn’t enough for large document collections)
- Token costs increase linearly with every request
- Model performance degrades with very long contexts (the “lost in the middle” problem)
Approach 3: RAG — The Only One That Works Well
RAG (Retrieval-Augmented Generation) takes a completely different approach: instead of stuffing everything into the prompt, it finds and injects only the most relevant parts of the documents for the given question. It sounds simple — and it is. But this is exactly what solves all three problems mentioned above.
The basic flow:
- Convert all documents into vector embeddings and store them in a vector database
- When a user asks a question, convert the question into a vector
- Find document chunks with the closest vectors (semantic similarity)
- Inject those chunks into the prompt alongside the question, then send to the LLM
- The LLM answers based on the provided context
Building RAG with LangChain — The Actual Code I Use
Installing Dependencies
pip install langchain langchain-community langchain-openai
pip install chromadb # vector database
pip install pypdf # read PDF files
pip install python-dotenv
I use ChromaDB because it runs locally without a server, making it ideal for prototyping. For larger production-scale deployments, you can switch to Qdrant or Pinecone.
Step 1: Load and Chunk Documents
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load all PDFs in the directory
loader = PyPDFDirectoryLoader("./docs/")
documents = loader.load()
# Split into chunks — this is the most important step
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # ~1000 characters per chunk
chunk_overlap=200, # overlap to avoid losing context between chunks
separators=["\n\n", "\n", ".", " "]
)
chunks = splitter.split_documents(documents)
print(f"Total chunks: {len(chunks)}")
Chunk size is the parameter I spent the most time tuning. Too small (200-300 characters) and you lose context; too large (2000+) and you introduce too much noise into the prompt. 800-1200 characters is the sweet spot for technical documentation.
Step 2: Create Embeddings and Save to Vector Store
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import os
from dotenv import load_dotenv
load_dotenv()
# Create embeddings
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small", # cheaper than large, good enough for English/Vietnamese
openai_api_key=os.getenv("OPENAI_API_KEY")
)
# Save to ChromaDB (persist_directory so we don't have to re-embed each time)
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
print("Vector store created and saved.")
Step 3: Build the Retrieval Chain
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
# Load existing vector store (no need to re-embed)
vectorstore = Chroma(
persist_directory="./chroma_db",
embedding_function=embeddings
)
# Retriever — fetch the top 4 most relevant chunks
retriever = vectorstore.as_retriever(
search_type="mmr", # Maximum Marginal Relevance — reduces redundancy
search_kwargs={"k": 4, "fetch_k": 10}
)
# Custom prompt — important to prevent the model from hallucinating
prompt_template = """Based on the documents provided below, please answer the question.
If the information is not in the documents, clearly state "I could not find this information in the documents."
Reference Documents:
{context}
Question: {question}
Answer:"""
PROMPT = PromptTemplate(
template=prompt_template,
input_variables=["context", "question"]
)
# Build chain
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
chain_type_kwargs={"prompt": PROMPT},
return_source_documents=True # for debugging and citing sources
)
# Test
result = qa_chain.invoke({"query": "What is the leave request process?"})
print(result["result"])
print("\nSources:", [doc.metadata for doc in result["source_documents"]])
Step 4: Complete Script — Separating Index and Query
# index.py — run once when new documents are added
python index.py --docs-dir ./docs/
# query.py — run daily
python query.py --question "New employee onboarding process"
Separating index and query is something I learned after receiving complaints from the team: every time the app restarted, everyone had to wait 5 minutes for all documents to be re-embedded. Persist the vectorstore to disk and only re-index when documents change.
Pitfalls I Stumbled Into — So You Don’t Have To
1. Ignoring Chunk Overlap
Many tutorials set chunk_overlap=0 for simplicity. In practice, important answers often sit right on the boundary between two chunks. An overlap of 15-20% of chunk_size is what I currently use.
2. Not Validating Retrieved Documents
The retriever doesn’t know when it’s returning irrelevant results — it will always return the top-k regardless of how low the score is. Set a threshold to filter first:
retriever = vectorstore.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={"score_threshold": 0.7, "k": 4}
)
3. Missing Metadata
When a user asks “which document mentions X”, you need to know which file and page that chunk came from. LangChain’s PDF loader automatically adds source and page to metadata — don’t remove them.
4. Vietnamese Text and Embeddings
For Vietnamese text: OpenAI’s text-embedding-3-small works quite well — at around $0.02/1M tokens, it’s nearly negligible at small-business scale. On a very tight budget, try paraphrase-multilingual-mpnet-base-v2 from HuggingFace — it runs locally, it’s free, and the quality is acceptable.
Results After 6 Months in Production
The RAG system I built handles around 200-300 questions per day from employees. Accuracy (correct answers with sources traced back to real documents) sits around 85-90% — compared to 0% when using a plain LLM with no context.
API costs also dropped significantly. Stuffing all 400 files into the prompt could cost 50k-100k tokens per query. Now the average is only around 1,500-2,000 tokens — meaning 30-50x cheaper.
Start small: ChromaDB locally, 50-100 sample pages, tested against 20-30 real questions from your team. One week is enough to know whether RAG fits your specific use case — much faster and cheaper than waiting for a fine-tuning pipeline to finish.

