LlamaIndex + Ollama: Building a Fully Offline Internal Document Q&A System

Artificial Intelligence tutorial - IT technology blog
Artificial Intelligence tutorial - IT technology blog

The Real Problem: When Internal Documents Can’t Go to the Cloud

My company once had a mountain of technical documentation: dozens of PDF and Word files covering operating procedures, contracts, and equipment manuals. Every time someone needed to look something up, they’d open file after file, hit Ctrl+F, and flip back and forth. A simple question could eat up an entire hour.

My first instinct was to upload everything to ChatGPT and ask away. But the moment I dragged a file into the browser, our IT manager stepped in and shut it down — the documents contained customer information and confidential contracts that couldn’t be sent to OpenAI’s servers.

That’s when I started exploring LlamaIndex combined with Ollama — running AI entirely on our internal server, with not a single byte of data leaving the company.

Three Common Ways to “Talk” to Your Documents

Before getting into the technical details, I want to give you an honest comparison of three approaches so you can pick the right tool for your situation.

Option 1 — Upload Directly to ChatGPT/Claude

The simplest approach: drag a PDF into ChatGPT and start asking. No setup required, works immediately.

  • Pros: No coding needed, instant setup, high answer quality
  • Cons: Data is sent to OpenAI/Anthropic servers. File size limits apply. You have to re-upload every session. Not suitable for sensitive documents.

Option 2 — LangChain RAG with a Cloud LLM

Build a RAG pipeline with LangChain using the OpenAI API as the LLM. This is the approach many developers are already familiar with; a unified proxy like LiteLLM can simplify switching between OpenAI, Anthropic, and other providers in this setup.

  • Pros: Flexible, broad ecosystem, plenty of documentation
  • Cons: Your queries (and the relevant document chunks) are still sent to the cloud. API costs accrue per token. Requires an internet connection.

Option 3 — LlamaIndex + Ollama (Fully Offline)

LlamaIndex is purpose-built for document indexing and querying. Paired with Ollama running a local LLM, the entire pipeline runs 100% on your own machine.

  • Pros: No data leaves your environment. No API costs. Works without internet after initial setup.
  • Cons: Requires capable hardware. Local model quality generally trails GPT-4.

How Does LlamaIndex Differ from LangChain?

LangChain is a general-purpose framework — it handles agents, chains, tool calling, RAG, and more. But trying to do everything means its API changes constantly and can be more complex than necessary for a pure document Q&A use case.

LlamaIndex was designed from day one to solve one problem: connecting your data to an LLM. Its concepts are cleaner and more focused:

  • Document Loaders: Dozens of built-in connectors for PDF, Word, CSV, Notion, Google Docs, and more
  • Node Parser: Splits documents into chunks with fine-grained control
  • Index: Multiple index types (VectorStoreIndex, SummaryIndex, KnowledgeGraphIndex…)
  • Query Engine: Handles questions and generates answers grounded in your documents

For a pure document Q&A task, LlamaIndex typically requires significantly less code. The example in the next step is around 15 lines — the equivalent in LangChain usually takes twice as many, not counting the effort of tracking which APIs were recently deprecated.

When Should You Choose LlamaIndex + Ollama?

This solution is the right fit when you:

  • Have sensitive internal documents (contracts, NDAs, customer data, trade secrets)
  • Work in an environment without internet access or on an intranet
  • Want to eliminate API costs when handling a high volume of daily queries
  • Need the system to run on an on-premise server

If your documents aren’t sensitive and you just want to test something quickly — ChatGPT will be much faster. Don’t over-engineer.

Practical Deployment Guide

Step 1 — Install Ollama and Pull a Model

# Install Ollama (Linux/Mac)
curl -fsSL https://ollama.com/install.sh | sh

# Pull a lightweight model, suitable for machines without a powerful GPU
ollama pull llama3.2

# Pull a dedicated embedding model (better quality than sharing one)
ollama pull nomic-embed-text

# Verify the models are ready
ollama list

The model runs at http://localhost:11434 — LlamaIndex will connect to this endpoint.

Step 2 — Install LlamaIndex and Dependencies

pip install llama-index llama-index-llms-ollama llama-index-embeddings-ollama

# Parsers for PDF and Word files
pip install llama-index-readers-file python-docx pypdf

Step 3 — Load PDFs and Word Files, Build the Index

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import Settings

# Configure to use Ollama instead of OpenAI
Settings.llm = Ollama(model="llama3.2", request_timeout=120.0)
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")

# Place all PDF and Word files in the ./documents/ folder
# SimpleDirectoryReader auto-detects file formats
documents = SimpleDirectoryReader("./documents").load_data()

# Create a vector index from the documents
index = VectorStoreIndex.from_documents(documents)

# Run a test query
query_engine = index.as_query_engine()
response = query_engine.query("What are the warranty terms in contract number 2024-001?")
print(response)

Step 4 — Persist the Index to Avoid Rebuilding Every Time

Building the index for the first time can take a few minutes — 50 PDF files typically takes 3–5 minutes on CPU. Save it once and load it instantly on subsequent runs:

from llama_index.core import StorageContext, load_index_from_storage
import os

PERSIST_DIR = "./storage"

if not os.path.exists(PERSIST_DIR):
    # First run: build and save to disk
    documents = SimpleDirectoryReader("./documents").load_data()
    index = VectorStoreIndex.from_documents(documents)
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    # Subsequent runs: load from disk, no reprocessing needed
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)

query_engine = index.as_query_engine()

Step 5 — Load from an Internal Website (Intranet)

If you have an internal wiki, documentation on Confluence, or a website accessible only within your company network — for more complex crawling scenarios involving JavaScript rendering or authentication, Crawl4AI is a more powerful alternative to SimpleWebPageReader:

from llama_index.readers.web import SimpleWebPageReader
from llama_index.core import SimpleDirectoryReader

# Load from internal URLs — run from within the internal network
loader = SimpleWebPageReader(html_to_text=True)
web_docs = loader.load_data(urls=[
    "http://wiki.internal/sop/quy-trinh-xuat-hang",
    "http://wiki.internal/sop/kiem-tra-chat-luong",
])

# Combine with file documents
file_docs = SimpleDirectoryReader("./documents").load_data()
all_docs = web_docs + file_docs

index = VectorStoreIndex.from_documents(all_docs)

Step 6 — A Simple Interface the Whole Team Can Use

# simple_qa.py
import sys
from llama_index.core import StorageContext, load_index_from_storage
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core import Settings

Settings.llm = Ollama(model="llama3.2", request_timeout=120.0)
Settings.embed_model = OllamaEmbedding(model_name="nomic-embed-text")

query = " ".join(sys.argv[1:])
if not query:
    print("Usage: python simple_qa.py <your question>")
    sys.exit(1)

storage_context = StorageContext.from_defaults(persist_dir="./storage")
index = load_index_from_storage(storage_context)
query_engine = index.as_query_engine(similarity_top_k=3)

response = query_engine.query(query)
print(f"\nAnswer:\n{response}\n")
print("Sources:", [n.metadata.get("file_name", "web") for n in response.source_nodes])
python simple_qa.py "When does the contract with client ABC expire?"

The source_nodes field tells you exactly which document and page the AI pulled its information from. This is one of RAG’s core strengths: you can verify the source immediately rather than blindly trusting the AI. A standalone LLM simply cannot do this.

Production Gotchas That Are Easy to Miss

These two points are the most commonly overlooked — yet they have the biggest impact on result quality:

Chunk size determines accuracy. LlamaIndex defaults to 1024 tokens per chunk. For dense technical documents — equipment manuals, spec sheets, operating procedures — a chunk size of 512 tokens with an overlap of 50 often delivers a noticeable improvement in accuracy:

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Settings

Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=50)

Scanned PDFs (image-based) will not be readable without OCR. LlamaIndex only reads the text layer. Image-based PDFs — common with invoices, older contracts, and documents that were printed and scanned — need to go through OCR first. Tesseract (open source, free) or IBM’s Docling can both handle this.

Quick Comparison of All Three Options

Criteria ChatGPT Upload LangChain + Cloud LLM LlamaIndex + Ollama
Data Security Sent to cloud Queries sent to cloud 100% local
Operating Cost Subscription Pay per token Free (hardware cost)
Setup Difficulty None needed Moderate Moderate
Answer Quality Highest High Model-dependent
Works Offline No No Yes

The system I’m currently running handles around 200 files — PDF manuals, contracts, and internal SOPs. Response time with llama3.2 running on CPU only is roughly 8–15 seconds per query. Slower than ChatGPT, but nobody on the team has to worry about company data sitting on someone else’s server — swapping to a higher-quality local model like DeepSeek-R1 on a GPU machine cuts that latency significantly.

The internal document problem shows up at almost every company — from factories that need to look up equipment manuals on the shop floor, to offices that need to sift through clauses across hundreds of contracts. If data security is a non-negotiable constraint, LlamaIndex + Ollama is the first stack worth trying — or AnythingLLM if you prefer a ready-made web UI over writing code.

Share: