The Real Problem: When Internal Documents Can’t Go to the Cloud
My company once had a mountain of technical documentation: dozens of PDF and Word files covering operating procedures, contracts, and equipment manuals. Every time someone needed to look something up, they’d open file after file, hit Ctrl+F, and flip back and forth. A simple question could eat up an entire hour.
My first instinct was to upload everything to ChatGPT and ask away. But the moment I dragged a file into the browser, our IT manager stepped in and shut it down — the documents contained customer information and confidential contracts that couldn’t be sent to OpenAI’s servers.
That’s when I started exploring LlamaIndex combined with Ollama — running AI entirely on our internal server, with not a single byte of data leaving the company.
Three Common Ways to “Talk” to Your Documents
Before getting into the technical details, I want to give you an honest comparison of three approaches so you can pick the right tool for your situation.
Option 1 — Upload Directly to ChatGPT/Claude
The simplest approach: drag a PDF into ChatGPT and start asking. No setup required, works immediately.
- Pros: No coding needed, instant setup, high answer quality
- Cons: Data is sent to OpenAI/Anthropic servers. File size limits apply. You have to re-upload every session. Not suitable for sensitive documents.
Option 2 — LangChain RAG with a Cloud LLM
Build a RAG pipeline with LangChain using the OpenAI API as the LLM. This is the approach many developers are already familiar with; a unified proxy like LiteLLM can simplify switching between OpenAI, Anthropic, and other providers in this setup.
- Pros: Flexible, broad ecosystem, plenty of documentation
- Cons: Your queries (and the relevant document chunks) are still sent to the cloud. API costs accrue per token. Requires an internet connection.
Option 3 — LlamaIndex + Ollama (Fully Offline)
LlamaIndex is purpose-built for document indexing and querying. Paired with Ollama running a local LLM, the entire pipeline runs 100% on your own machine.
- Pros: No data leaves your environment. No API costs. Works without internet after initial setup.
- Cons: Requires capable hardware. Local model quality generally trails GPT-4.
How Does LlamaIndex Differ from LangChain?
LangChain is a general-purpose framework — it handles agents, chains, tool calling, RAG, and more. But trying to do everything means its API changes constantly and can be more complex than necessary for a pure document Q&A use case.
LlamaIndex was designed from day one to solve one problem: connecting your data to an LLM. Its concepts are cleaner and more focused:
- Document Loaders: Dozens of built-in connectors for PDF, Word, CSV, Notion, Google Docs, and more
- Node Parser: Splits documents into chunks with fine-grained control
- Index: Multiple index types (VectorStoreIndex, SummaryIndex, KnowledgeGraphIndex…)
- Query Engine: Handles questions and generates answers grounded in your documents
For a pure document Q&A task, LlamaIndex typically requires significantly less code. The example in the next step is around 15 lines — the equivalent in LangChain usually takes twice as many, not counting the effort of tracking which APIs were recently deprecated.
When Should You Choose LlamaIndex + Ollama?
This solution is the right fit when you:
- Have sensitive internal documents (contracts, NDAs, customer data, trade secrets)
- Work in an environment without internet access or on an intranet
- Want to eliminate API costs when handling a high volume of daily queries
- Need the system to run on an on-premise server
If your documents aren’t sensitive and you just want to test something quickly — ChatGPT will be much faster. Don’t over-engineer.
Practical Deployment Guide
Step 1 — Install Ollama and Pull a Model
# Install Ollama (Linux/Mac)
curl -fsSL https://ollama.com/install.sh | sh
# Pull a lightweight model, suitable for machines without a powerful GPU
ollama pull llama3.2
# Pull a dedicated embedding model (better quality than sharing one)
ollama pull nomic-embed-text
# Verify the models are ready
ollama list
The model runs at http://localhost:11434 — LlamaIndex will connect to this endpoint.
Step 2 — Install LlamaIndex and Dependencies
pip install llama-index llama-index-llms-ollama llama-index-embeddings-ollama
# Parsers for PDF and Word files
pip install llama-index-readers-file python-docx pypdf
Step 3 — Load PDFs and Word Files, Build the Index
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import Settings
# Configure to use Ollama instead of OpenAI
Settings.llm = Ollama(model="llama3.2", request_timeout=120.0)
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
# Place all PDF and Word files in the ./documents/ folder
# SimpleDirectoryReader auto-detects file formats
documents = SimpleDirectoryReader("./documents").load_data()
# Create a vector index from the documents
index = VectorStoreIndex.from_documents(documents)
# Run a test query
query_engine = index.as_query_engine()
response = query_engine.query("What are the warranty terms in contract number 2024-001?")
print(response)
Step 4 — Persist the Index to Avoid Rebuilding Every Time
Building the index for the first time can take a few minutes — 50 PDF files typically takes 3–5 minutes on CPU. Save it once and load it instantly on subsequent runs:
from llama_index.core import StorageContext, load_index_from_storage
import os
PERSIST_DIR = "./storage"
if not os.path.exists(PERSIST_DIR):
# First run: build and save to disk
documents = SimpleDirectoryReader("./documents").load_data()
index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
# Subsequent runs: load from disk, no reprocessing needed
storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
index = load_index_from_storage(storage_context)
query_engine = index.as_query_engine()
Step 5 — Load from an Internal Website (Intranet)
If you have an internal wiki, documentation on Confluence, or a website accessible only within your company network — for more complex crawling scenarios involving JavaScript rendering or authentication, Crawl4AI is a more powerful alternative to SimpleWebPageReader:
from llama_index.readers.web import SimpleWebPageReader
from llama_index.core import SimpleDirectoryReader
# Load from internal URLs — run from within the internal network
loader = SimpleWebPageReader(html_to_text=True)
web_docs = loader.load_data(urls=[
"http://wiki.internal/sop/quy-trinh-xuat-hang",
"http://wiki.internal/sop/kiem-tra-chat-luong",
])
# Combine with file documents
file_docs = SimpleDirectoryReader("./documents").load_data()
all_docs = web_docs + file_docs
index = VectorStoreIndex.from_documents(all_docs)
Step 6 — A Simple Interface the Whole Team Can Use
# simple_qa.py
import sys
from llama_index.core import StorageContext, load_index_from_storage
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import Settings
Settings.llm = Ollama(model="llama3.2", request_timeout=120.0)
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")
query = " ".join(sys.argv[1:])
if not query:
print("Usage: python simple_qa.py <your question>")
sys.exit(1)
storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)
query_engine = index.as_query_engine(similarity_top_k=3)
response = query_engine.query(query)
print(f"\nAnswer:\n{response}\n")
print("Sources:", [n.metadata.get("file_name", "web") for n in response.source_nodes])
python simple_qa.py "When does the contract with client ABC expire?"
The source_nodes field tells you exactly which document and page the AI pulled its information from. This is one of RAG’s core strengths: you can verify the source immediately rather than blindly trusting the AI. A standalone LLM simply cannot do this.
Production Gotchas That Are Easy to Miss
These two points are the most commonly overlooked — yet they have the biggest impact on result quality:
Chunk size determines accuracy. LlamaIndex defaults to 1024 tokens per chunk. For dense technical documents — equipment manuals, spec sheets, operating procedures — a chunk size of 512 tokens with an overlap of 50 often delivers a noticeable improvement in accuracy:
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Settings
Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=50)
Scanned PDFs (image-based) will not be readable without OCR. LlamaIndex only reads the text layer. Image-based PDFs — common with invoices, older contracts, and documents that were printed and scanned — need to go through OCR first. Tesseract (open source, free) or IBM’s Docling can both handle this.
Quick Comparison of All Three Options
| Criteria | ChatGPT Upload | LangChain + Cloud LLM | LlamaIndex + Ollama |
|---|---|---|---|
| Data Security | Sent to cloud | Queries sent to cloud | 100% local |
| Operating Cost | Subscription | Pay per token | Free (hardware cost) |
| Setup Difficulty | None needed | Moderate | Moderate |
| Answer Quality | Highest | High | Model-dependent |
| Works Offline | No | No | Yes |
The system I’m currently running handles around 200 files — PDF manuals, contracts, and internal SOPs. Response time with llama3.2 running on CPU only is roughly 8–15 seconds per query. Slower than ChatGPT, but nobody on the team has to worry about company data sitting on someone else’s server — swapping to a higher-quality local model like DeepSeek-R1 on a GPU machine cuts that latency significantly.
The internal document problem shows up at almost every company — from factories that need to look up equipment manuals on the shop floor, to offices that need to sift through clauses across hundreds of contracts. If data security is a non-negotiable constraint, LlamaIndex + Ollama is the first stack worth trying — or AnythingLLM if you prefer a ready-made web UI over writing code.

