LLM App Development #7: Embeddings and Vector Search

Thursday, June 4, 2026

5 min read

In Part 6 we connected Claude to external functions. But there is a common request: “answer based on our company documents.” Claude does not know our documents, so we have to find the documents related to the question first and hand them over. The core technique for this “finding related documents” is embeddings and vector search. In this post we build that foundation, and in the next one we will complete it as RAG.

What an embedding is #

An embedding turns text into a list of numbers, that is, a vector. It is not a plain conversion but one designed so that text with similar meaning becomes a similar vector. For example, “puppy” and “dog” become nearby vectors, while “puppy” and “stock market” become far-apart vectors.

Thanks to this property, measuring the distance between vectors measures the similarity of meaning. Even when keywords do not exactly overlap, searching for “how to get a refund” can find a document about “the payment cancellation process.” It searches by meaning, not by words.

Note

Claude is a text-generation model and does not produce embeddings. Embeddings come from a dedicated embedding model. The example below uses sentence-transformers, which runs locally without a separate key. Real services sometimes use a high-quality hosted embedding API (such as Voyage AI). Either way, the usage — “put in text, get out a vector” — is the same.

Turning text into vectors #

Let me turn a few sentences into vectors with sentence-transformers.

embed.py

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

texts = [
    "How to walk a dog",
    "The importance of exercise for dogs",
    "Today's stock market trends",
]

vectors = model.encode(texts)
print(vectors.shape)  # (3, 384) — 3 sentences, each a 384-dimensional vector

Each sentence became a vector of 384 numbers. The number of dimensions (384 here) differs by model, but vectors made by the same model share the same dimensions, so you can measure distances between them.

Measuring similarity with vectors #

How similar two vectors are is usually measured by cosine similarity. The closer to 1, the more similar the meaning; the closer to 0, the more unrelated.

similarity.py

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

dog_walk, dog_exercise, stocks = vectors

print(cosine_similarity(dog_walk, dog_exercise))  # high (both about dogs)
print(cosine_similarity(dog_walk, stocks))        # low (unrelated)

“Walking a dog” and “exercise for dogs” share almost no words, yet the similarity comes out high, because the meanings are close. Meanwhile “walking a dog” and “stock trends” come out low. This is what differs from keyword search.

Building vector search #

If you turn several documents into vectors in advance, then when a question arrives you can turn the question into a vector too and find the most similar document. This is vector search.

vector_search.py

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

documents = [
    "Refunds can be requested on My Page within 7 days of purchase.",
    "Delivery usually takes 2 to 3 days after ordering.",
    "Membership tier rises automatically based on cumulative spending.",
]
doc_vectors = model.encode(documents)

def search(query: str, top_k: int = 1):
    q = model.encode([query])[0]
    scores = doc_vectors @ q  # if normalized, the dot product is the cosine similarity
    ranked = np.argsort(scores)[::-1][:top_k]
    return [(documents[i], float(scores[i])) for i in ranked]

print(search("How do I get my money back?"))
# the refund document comes out as the most similar

The question “How do I get my money back?” has no word “refund,” yet it finds the refund document whose meaning is closest. This is the basic principle of finding related documents in an LLM app.

Choosing an embedding model #

There are many embedding models, and a few criteria guide the choice.

Number of dimensions — the length of the vector. Higher dimensions capture finer differences in meaning but increase storage and compute cost. The all-MiniLM-L6-v2 above is 384-dimensional, on the light and fast side.
Language — if you handle Korean documents, you need a model that handles Korean well, or a multilingual one. A model trained mostly on English may judge Korean similarity inaccurately.
Local vs. hosted — a local model runs free without a key but has limits on quality and speed. A hosted API costs money but has higher quality and handles long documents better.

The key is to use one chosen model consistently. As the pitfalls below show, you must embed documents and the question with the same model for them to be comparable. So if you change the model, you have to rebuild all the stored document vectors too. Start light, and if quality is lacking move to a better model — keeping in mind that the move entails a full re-embedding.

Where people commonly trip up #

The search model differs from the storage model — If the model that embedded the documents differs from the one embedding the question, the vectors cannot be compared. They must be made consistently by the same model.
Confusion over the normalization assumption — The example above used the dot product as similarity on the assumption that the model returns normalized vectors. For a model that does not, use the cosine similarity formula directly.
Re-embedding every time — Build document embeddings once and store them for reuse. Re-embedding the whole document set on every search is slow and expensive.

Wrapping up #

In this post we covered embeddings and vector search, which find documents by meaning.

An embedding turns text into a vector such that similar meaning means a nearby vector.
Cosine similarity between vectors measures similarity of meaning.
Embedding documents in advance and finding the one closest to the question is vector search; as scale grows, you use a vector database.

Now you can find documents related to a question. In the next post, “LLM App Development #8: Building a RAG Pipeline,” we will hand these found documents to Claude and complete RAG, so it answers based on our documents.