How the Codebase Uses Vector Information to Answer Users in Chat
This document describes how the current codebase uses vector (embedding) information in the RAG (Retrieval-Augmented Generation) flow to answer customer chat messages, and provides the full call stack from the storefront widget to the final reply.
Overview
Section titled “Overview”When a user sends a message in the chat widget:
- The user message is turned into a 1536-dimensional embedding vector via OpenRouter’s embeddings API.
- That query vector is used in a pgvector similarity search (cosine distance) over the shop’s stored knowledge chunks.
- The top-K most similar chunks (text + optional source URL) are passed as context into a system prompt for the LLM.
- The LLM (OpenRouter chat) generates a reply using only that context (and optional order info).
- The reply is returned to the widget, and the retrieved chunks are stored for audit (e.g.
retrieved_chunks).
So: vectors are used only for retrieval. They select which pre-indexed text chunks (products, collections, articles, website knowledge) are sent to the LLM; the LLM never sees raw vectors.
Data Model (Where Vectors Live)
Section titled “Data Model (Where Vectors Live)”-
embeddings
One row per knowledge chunk. Columns include:chunk_id→knowledge_chunks.idembedding→ vector(1536) (pgvector; not managed by Prisma, only via raw SQL)model,created_at
-
knowledge_chunks
Text chunks from products, collections, articles, or website knowledge. Each chunk can have one embedding. -
knowledge_documents
Parent of chunks; holdssource_url(storefront URL) so the AI can include links in replies.
Vectors are written during ingestion (product/collection/article/website sync) via app/lib/ingestion.server.js and read during chat via app/lib/vector-search.server.js (raw SQL with <=> cosine distance).
Full Call Stack (Vector-Related Path)
Section titled “Full Call Stack (Vector-Related Path)”Below is the full call stack from the storefront to the reply, with file paths and function names. Steps that use or depend on vector information are marked.
1. Storefront (customer browser) └── extensions/appifire-chat/assets/chat-widget.js └── sendMessage() └── fetch(POST API_URL + '/api/chat', { shopDomain, visitorId, sessionId, message }) │2. HTTP layer └── app/routes/api.chat.jsx └── action({ request }) ├── Parse body (shopDomain, visitorId, sessionId, message, orderNameOrId) ├── Validate (shopDomain, message length ≤ 2000) ├── prisma.shop.findFirst({ where: { shopDomain, status: 'active' } }) └── import("../lib/rag.server.js").then(({ generateChatReply }) => generateChatReply({ shop, visitorId, sessionId, message, orderNameOrId }) │3. RAG orchestrator [VECTOR USAGE STARTS] └── app/lib/rag.server.js └── generateChatReply({ shop, visitorId, sessionId, message, orderNameOrId }) ├── checkReplyLimit(prisma, shop.id, shop.plan) ├── checkCreditBalance(prisma, shop) ├── Get or create ChatSession (prisma.chatSession.findFirst / create) ├── prisma.chatMessage.create({ sessionId, role: 'user', messageText: message }) ├── Load history (prisma.chatMessage.findMany, last 10 messages) │ ├── [VECTOR] getEmbeddingsBatched([message]) ← 6. Embed user message │ └── app/lib/embeddings.server.js │ └── getEmbeddingsBatched([message]) │ └── getEmbeddingsWithMeta(batch) │ └── fetch(OPENROUTER_EMBEDDINGS_URL, { model, input: texts }) │ → returns { vectors: [queryVector], totalTokens, totalCost } │ → queryVector = 1536-dim array used below │ ├── prisma.openRouterCall.create({ endpoint: 'embeddings', ... }) // log embedding cost │ ├── [VECTOR] searchSimilarChunks({ shopId, queryVector, topK: 5 }) ← 7. Vector search │ └── app/lib/vector-search.server.js │ └── searchSimilarChunks({ shopId, queryVector, topK }) │ └── prisma.$queryRawUnsafe(` │ SELECT kc.id, kc.chunk_text, kc.product_id, kd.source_url, │ (1 - (e.embedding <=> $1::vector))::double precision AS similarity │ FROM embeddings e │ JOIN knowledge_chunks kc ON kc.id = e.chunk_id │ JOIN knowledge_documents kd ON kd.id = kc.document_id │ WHERE kc.shop_id = $2::uuid │ ORDER BY e.embedding <=> $1::vector │ LIMIT $3`, vectorStr, shopId, topK) │ → rows: { id, chunk_text, product_id, source_url, similarity } │ ├── If chunks.length === 0 → return canned “not enough information” reply (no LLM call) │ ├── Optional: fetchOrderByShop(...) → orderContext │ ├── buildPrompt(chunks, message, historyForPrompt, { orderContext }) ← 10. Use chunks as context │ └── app/lib/prompt-builder.server.js │ └── buildPrompt(chunks, message, history, opts) │ └── context = chunks.map(c => `[Store Info i]\nLink: ${url}\n${c.chunk_text}`).join("\n\n") │ └── messages = [ system(systemPrompt + context + orderContext), ...history, user(message) ] │ → messages passed to LLM (no vectors, only text) │ ├── callOpenRouterChat(promptMessages, shop.plan) │ └── app/lib/chat.server.js │ └── callOpenRouterChat(messages, plan) │ └── fetch(OpenRouter chat/completions, { model, messages, max_tokens, temperature }) │ → { reply, model, usage, cost } │ ├── prisma.chatMessage.create({ sessionId, role: 'assistant', messageText: reply, ... }) ├── For each chunk: prisma.retrievedChunk.create({ chatMessageId, chunkId, similarityScore, rank }) ├── prisma.openRouterCall.create({ endpoint: 'chat', ... }) ├── deductCreditsForReply(prisma, shop.id, totalChargedCents) └── return { reply, sessionId } │4. Back to API route └── app/routes/api.chat.jsx └── return cors(Response.json({ reply: result.reply, sessionId: result.sessionId })) │5. Storefront └── chat-widget.js: sessionId = data.sessionId; addMessage(data.reply, 'bot');Summary Table (Vector-Related Steps)
Section titled “Summary Table (Vector-Related Steps)”| Step | File | Function / Action | Role of vectors |
|---|---|---|---|
| 1 | extensions/appifire-chat/assets/chat-widget.js | sendMessage() | Sends message in POST body (no vectors). |
| 2 | app/routes/api.chat.jsx | action() | Validates input, loads shop, calls generateChatReply. |
| 3a | app/lib/rag.server.js | generateChatReply() | Orchestrates: embed → search → prompt → LLM → save. |
| 3b | app/lib/embeddings.server.js | getEmbeddingsBatched([message]) | Produces query vector (1536-d) via OpenRouter. |
| 3c | app/lib/vector-search.server.js | searchSimilarChunks({ shopId, queryVector, topK: 5 }) | Uses query vector in pgvector <=> search; returns top-5 chunks + similarity. |
| 3d | app/lib/prompt-builder.server.js | buildPrompt(chunks, message, history, opts) | Uses chunk text (and source_url) in system context; no vectors. |
| 3e | app/lib/chat.server.js | callOpenRouterChat(messages, plan) | Sends text-only messages to LLM; returns reply. |
Vector Search Detail
Section titled “Vector Search Detail”- Operator:
<=>is cosine distance (pgvector). The code uses(1 - (e.embedding <=> $1::vector))so the result is a similarity in [0, 1]. - Scope: Only chunks for
kc.shop_id = $2::uuidare considered. - Output: Up to
topK(default 5) rows:id,chunk_text,product_id,source_url,similarity. This is the only place the stored vectors inembeddings.embeddingare read during chat.
Where Vectors Are Created (Ingestion, Not Chat)
Section titled “Where Vectors Are Created (Ingestion, Not Chat)”Vectors are not created in the chat path; they are created when knowledge is ingested:
app/lib/ingestion.server.jsingestProduct,ingestCollection,ingestArticle,ingestWebsiteKnowledge- Each builds text chunks, calls
getEmbeddingsBatched(chunkTexts), then inserts intoknowledge_chunksandembeddings(raw SQLINSERT ... embedding = $vector::vector).
So the full picture is: ingestion writes vectors for each chunk; at chat time we embed only the user message, run one vector search, and use the retrieved chunk texts (and URLs) as RAG context for the LLM.