Building Semantic Search with Python: When Computers Truly ‘Understand’ User Intent

Artificial Intelligence tutorial - IT technology blog
Artificial Intelligence tutorial - IT technology blog

The Problem: Why Keyword Search is No Longer Enough?

If you’re coding search features using LIKE %keyword% in traditional SQL or Elasticsearch, you’ve likely encountered a common headache: users don’t always enter the exact keywords stored in your database. In fact, about 30% of searches fail simply due to synonyms or different ways of phrasing a query.

For example, if your database contains “Guide to cooking beef pho” and a user types “how to make traditional Vietnamese food,” traditional systems often return zero results. This is where Semantic Search comes in. This technique helps computers understand the meaning behind words rather than just matching characters.

Quick Start: Build a Smart Search System in 5 Minutes

To get started, we’ll use the sentence-transformers library. This powerful toolkit, based on HuggingFace, makes Natural Language Processing (NLP) simpler than ever.

1. Install the Library

Simply open your terminal and run the command to install the necessary packages:

pip install sentence-transformers torch

2. Practical Search Script

Here is an optimized code snippet you can run immediately to verify the results:

from sentence_transformers import SentenceTransformer, util

# 1. Download Multilingual model
# This model handles Vietnamese exceptionally well and is lightweight
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')

# 2. Sample dataset
documents = [
    "Traditional beef pho cooking guide at home",
    "Basic Python programming for beginners",
    "The weather in Hanoi today is very beautiful",
    "Guide to using AI libraries in Python",
    "What is the most delicious dish in the world?"
]

# 3. Convert database to Vector form (Embedding)
document_embeddings = model.encode(documents)

# 4. User query
query = "I want to learn AI coding with Python"
query_embedding = model.encode(query)

# 5. Search for the most relevant results using Cosine Similarity
hits = util.semantic_search(query_embedding, document_embeddings, top_k=2)

print(f"Query: {query}")
for hit in hits[0]:
    print(f"- Result: {documents[hit['corpus_id']]} (Score: {hit['score']:.4f})")

The results might surprise you. Even though the phrase “AI coding” doesn’t appear in the database, the system understands it is most relevant to “Guide to using AI libraries in Python.” That is the power of Embeddings.

Technical Explanation: The Math Behind the Meaning

The Concept of Vector Embedding

Imagine each sentence as a coordinate in a multi-dimensional space. Sentences with similar meanings will be close to each other, while unrelated ones will be pushed further apart. The Sentence-Transformers model acts as a translator, turning character strings into an array of floating-point numbers (Vectors) that represent their meaning.

The Cosine Similarity Mechanism

Once you have the coordinates, searching is essentially calculating the angle between two vectors. The smaller the angle (and the closer the Cosine Similarity is to 1), the more similar the sentences are in content. All this complex math is handled elegantly by the library in a single line of code.

Why Choose a Multilingual Model?

Practical experience shows that standard models like all-MiniLM-L6-v2 are typically only proficient in English. In contrast, the paraphrase-multilingual-MiniLM-L12-v2 model used above was trained on dozens of languages. As a result, it understands Vietnamese vocabulary relationships very naturally.

Advanced: Handling Million-Row Datasets

The code above runs very fast with a few dozen rows of data. However, if your database reaches millions of records, manually comparing every vector pair will quickly overload your server.

For optimization, I often combine this with FAISS (Facebook AI Similarity Search). This tool allows searching through a dataset of 1 million vectors in less than 10ms on a standard CPU.

# Installation: pip install faiss-cpu
import faiss

d_size = document_embeddings.shape[1]
index = faiss.IndexFlatL2(d_size)
index.add(document_embeddings)

# Search for the 2 nearest results in the blink of an eye
D, I = index.search(query_embedding.reshape(1, -1), 2)

Practical Implementation Tips

  • Use Hybrid Search: Don’t rush to discard keyword search. The most optimal systems usually combine both: using BM25 for exact matches and Semantic Search to provide contextually relevant results.
  • Speed up with Pre-encoding: Calculating vectors is CPU-intensive. You should pre-calculate Embeddings when inserting data into your database and store them. Don’t wait until a user searches to start encoding.
  • Clean your data (Preprocessing): Remove HTML tags, convert to lowercase, and handle whitespace before feeding data into the model. The cleaner the data, the more accurate the vector representation.
  • Balance Performance: MiniLM models are typically 5-10 times faster than BERT-base models while maintaining 90-95% accuracy. This is the best choice for real-world projects.

I hope these insights help you confidently integrate a “brain” into your project’s search features. Semantic Search is no longer the exclusive domain of tech giants; it is now a tool within reach for every Python developer.

Share: