Running LLMs Locally with Ollama: Comparing Approaches and a Practical Deployment Guide

Artificial Intelligence tutorial - IT technology blog
Artificial Intelligence tutorial - IT technology blog

2 AM. Terminal shows 429 Too Many Requests. OpenAI API key just hit the rate limit, demo at 8 AM tomorrow. I sat staring at the screen and for the first time seriously looked into running an LLM directly on my own machine.

After a few months of real-world use on both a laptop and a VPS, here’s what I’ve distilled — especially which approach suits which situation, and how to set up Ollama from scratch until you have a working API.

Comparing the 3 Most Popular Ways to Run LLMs Locally

Before deciding to use Ollama, I tried three different approaches. Each has its own trade-offs — none is universally the best.

llama.cpp — Pure C++ Engine, Maximum Performance

llama.cpp runs GGUF format models directly with no abstraction layer. Extremely lightweight, easy to compile with CUDA or Metal to fully leverage your GPU. The trade-off: you have to do everything yourself — compile the binary, download models from HuggingFace, write your own wrapper if you want a REST API.

# Run llama.cpp server — quite verbose, easy to mix up flags
./llama-server -m models/mistral-7b-q4_K_M.gguf \
  --ctx-size 4096 \
  --n-gpu-layers 35 \
  --host 0.0.0.0 \
  --port 8080

Great for ML research or when you need to squeeze every token/second out of your hardware. Not the right choice when you need to deploy quickly.

LM Studio — Beautiful GUI, but Stuck on the Desktop

LM Studio has a polished interface, a built-in HuggingFace browser, and even a local API server compatible with OpenAI. I use it to explore new models — a few clicks and it’s running, very convenient. But it’s a desktop app. You can’t SSH into it, can’t run it on a headless VPS, can’t containerize it. Need to integrate it into a CI pipeline or a shared server for the whole team? LM Studio is a dead end.

Ollama — CLI + API, Runs Anywhere

Ollama wraps llama.cpp underneath but exposes a Docker-like interface: pull a model, run a model, built-in REST API. Runs on macOS, Linux (including headless servers), and Windows. The key advantage is a consistent workflow from your dev laptop to a production server — nothing needs to change when switching environments.

Pros and Cons: Which One Should You Choose?

  • llama.cpp: Choose this when you need maximum performance, are doing ML research, or building a custom inference pipeline. Not suitable for beginners or when you need to ship fast.
  • LM Studio: Choose this when you just want to try out models on your personal machine and don’t need code integration. Stop there — don’t build production apps on it.
  • Ollama: Choose this for everything else — local dev, prototyping, shared team servers, offline environments. A solid balance between ease of use and flexibility.

I’ve used all three in real projects. Pragmatic takeaway: Ollama is sufficient for 90% of everyday developers. llama.cpp is only worth the extra effort when you genuinely need to squeeze every millisecond of inference time. LM Studio stays at the personal experimentation level.

Installing Ollama

Linux (VPS or server)

curl -fsSL https://ollama.com/install.sh | sh

The script auto-detects your GPU, installs CUDA drivers if needed, and sets up a systemd service automatically — the whole process takes about 2–3 minutes. Verify after installation:

# Check if the service is running
systemctl status ollama

# Ollama listens on port 11434
curl http://localhost:11434
# Output: Ollama is running

macOS

# Using Homebrew
brew install ollama

# Start the server (if not using the app tray)
ollama serve

Running Your First Model

Familiar syntax if you’ve used Docker before:

# Llama 3.2 3B — lightweight, good for machines without a powerful GPU (~2GB)
ollama run llama3.2

# Mistral 7B — good balance between quality and speed (~4GB)
ollama run mistral

# Qwen 2.5 7B — better support for Vietnamese and Japanese
ollama run qwen2.5:7b

# List downloaded models
ollama list

# Remove unused models (free up disk space)
ollama rm mistral

On the first run, Ollama downloads the model locally. Mistral 7B is about 4GB, Llama 3.2 3B is about 2GB. Subsequent runs load from cache — no extra download time.

No GPU? No problem. 3B–7B models still run on CPU at around 5–15 tokens/second depending on your chip. Perfectly comfortable for development and testing — just not enough if you’re running production workloads with many concurrent users.

Choosing a Model Based on RAM

  • 4GB RAM: llama3.2:3b, phi3:mini
  • 8GB RAM: mistral:7b, llama3.1:8b, qwen2.5:7b
  • 16GB RAM: llama3.1:13b, codestral:22b
  • 32GB+ RAM: llama3.3:70b (slow but doable)

Integrating Ollama into Your Code via REST API

This is what I use most. Ollama exposes an OpenAI-compatible API — existing code barely needs any changes, just swap the base_url:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Any string works, Ollama doesn't validate it
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {"role": "user", "content": "Explain Docker networking in 3 sentences"}
    ]
)
print(response.choices[0].message.content)

Don’t want to install extra packages? Call the native API directly with curl:

curl http://localhost:11434/api/chat -d '{
  "model": "mistral",
  "messages": [
    {"role": "user", "content": "Write a bash script to check disk usage"}
  ],
  "stream": false
}'

Practical Tips for Using Ollama

Expose to the Internal Network

By default, Ollama only binds to 127.0.0.1 — accessible only from that machine. To let your whole team share one server, add an environment variable via a systemd override:

# Create an override for the systemd service
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
EOF

sudo systemctl daemon-reload && sudo systemctl restart ollama

Creating Custom Models with a Modelfile

Need an internal technical support chatbot with a fixed system prompt? Use a Modelfile — instead of pasting the same prompt into every request, bake it directly into the model:

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM mistral

SYSTEM """
You are a Linux expert with 10 years of experience.
Only answer questions about Linux, shell scripting, and system administration.
Be concise and always include practical examples.
"""
EOF

# Build the custom model
ollama create linux-expert -f Modelfile

# Test
ollama run linux-expert "Which process is using port 8080?"

A single Modelfile can save dozens of lines of boilerplate code — especially when multiple team members need to work from the same base prompt.

When to Use Ollama vs. When You Still Need a Cloud API

Ollama is the right choice when:

  • Local development — no API costs, no rate limit surprises at 2 AM
  • Processing sensitive data — data never leaves your machine
  • Small teams needing a shared AI server — one VPS running Ollama, the whole team uses it
  • Offline environments and air-gapped networks

You should still use a cloud API when you need the most powerful models (GPT-4o level), complex multimodal tasks, or high production traffic that your hardware simply can’t handle.

Going back to that night — I got Ollama running with Mistral 7B at 3 AM, and the 8 AM demo ran flawlessly without a single error. Since then I’ve kept Ollama as a permanent fallback on my dev machine, and I no longer worry about cloud API rate limits derailing my work.

Share: