RAG lets an AI answer questions from real documents rather than being limited to its internal knowledge.
An LLM has a knowledge cut-off date and knows nothing about your internal documents. Result: it hallucinates or answers generically. RAG (Retrieval-Augmented Generation) solves this by injecting relevant document excerpts into the model's context, retrieved in real time.
An embedding is a vector representation of text in a high-dimensional space. Semantically close texts have close vectors. We encode each document and each user query as vectors, then search for the documents closest to the query.
Vector databases like Pinecone, Weaviate or pgvector store millions of vectors and perform similarity searches in milliseconds. That is the engine behind the documentary memory of a RAG system.
A simplified semantic search in a vector database:
async function searchDocuments(query: string) {
const embedding = await createEmbedding(query);
const results = await vectorDatabase.search({
vector: embedding,
topK: 5,
});
return results.map((doc) => doc.content);
}