How the Codebase Uses Vector Information to Answer Users in Chat

This document describes how the current codebase uses vector (embedding) information in the RAG (Retrieval-Augmented Generation) flow to answer customer chat messages, and provides the full call stack from the storefront widget to the final reply.

Overview

When a user sends a message in the chat widget:

The user message is turned into a 1536-dimensional embedding vector via OpenRouter’s embeddings API.
That query vector is used in a pgvector similarity search (cosine distance) over the shop’s stored knowledge chunks.
The top-K most similar chunks (text + optional source URL) are passed as context into a system prompt for the LLM.
The LLM (OpenRouter chat) generates a reply using only that context (and optional order info).
The reply is returned to the widget, and the retrieved chunks are stored for audit (e.g. retrieved_chunks).

So: vectors are used only for retrieval. They select which pre-indexed text chunks (products, collections, articles, website knowledge) are sent to the LLM; the LLM never sees raw vectors.

Data Model (Where Vectors Live)

embeddings
One row per knowledge chunk. Columns include:
- chunk_id → knowledge_chunks.id
- embedding → vector(1536) (pgvector; not managed by Prisma, only via raw SQL)
- model, created_at
knowledge_chunks
Text chunks from products, collections, articles, or website knowledge. Each chunk can have one embedding.
knowledge_documents
Parent of chunks; holds source_url (storefront URL) so the AI can include links in replies.

Vectors are written during ingestion (product/collection/article/website sync) via app/lib/ingestion.server.js and read during chat via app/lib/vector-search.server.js (raw SQL with <=> cosine distance).

Below is the full call stack from the storefront to the reply, with file paths and function names. Steps that use or depend on vector information are marked.

1. Storefront (customer browser)
   └── extensions/appifire-chat/assets/chat-widget.js
       └── sendMessage()
           └── fetch(POST API_URL + '/api/chat', { shopDomain, visitorId, sessionId, message })
               │
2. HTTP layer
   └── app/routes/api.chat.jsx
       └── action({ request })
           ├── Parse body (shopDomain, visitorId, sessionId, message, orderNameOrId)
           ├── Validate (shopDomain, message length ≤ 2000)
           ├── prisma.shop.findFirst({ where: { shopDomain, status: 'active' } })
           └── import("../lib/rag.server.js").then(({ generateChatReply }) =>
               generateChatReply({ shop, visitorId, sessionId, message, orderNameOrId })
               │
3. RAG orchestrator  [VECTOR USAGE STARTS]
   └── app/lib/rag.server.js
       └── generateChatReply({ shop, visitorId, sessionId, message, orderNameOrId })
           ├── checkReplyLimit(prisma, shop.id, shop.plan)
           ├── checkCreditBalance(prisma, shop)
           ├── Get or create ChatSession (prisma.chatSession.findFirst / create)
           ├── prisma.chatMessage.create({ sessionId, role: 'user', messageText: message })
           ├── Load history (prisma.chatMessage.findMany, last 10 messages)
           │
           ├── [VECTOR] getEmbeddingsBatched([message])   ← 6. Embed user message
           │   └── app/lib/embeddings.server.js
           │       └── getEmbeddingsBatched([message])
           │           └── getEmbeddingsWithMeta(batch)
           │               └── fetch(OPENROUTER_EMBEDDINGS_URL, { model, input: texts })
           │               → returns { vectors: [queryVector], totalTokens, totalCost }
           │   → queryVector = 1536-dim array used below
           │
           ├── prisma.openRouterCall.create({ endpoint: 'embeddings', ... })  // log embedding cost
           │
           ├── [VECTOR] searchSimilarChunks({ shopId, queryVector, topK: 5 })   ← 7. Vector search
           │   └── app/lib/vector-search.server.js
           │       └── searchSimilarChunks({ shopId, queryVector, topK })
           │           └── prisma.$queryRawUnsafe(`
           │                 SELECT kc.id, kc.chunk_text, kc.product_id, kd.source_url,
           │                        (1 - (e.embedding <=> $1::vector))::double precision AS similarity
           │                 FROM embeddings e
           │                 JOIN knowledge_chunks kc ON kc.id = e.chunk_id
           │                 JOIN knowledge_documents kd ON kd.id = kc.document_id
           │                 WHERE kc.shop_id = $2::uuid
           │                 ORDER BY e.embedding <=> $1::vector
           │                 LIMIT $3`, vectorStr, shopId, topK)
           │           → rows: { id, chunk_text, product_id, source_url, similarity }
           │
           ├── If chunks.length === 0 → return canned “not enough information” reply (no LLM call)
           │
           ├── Optional: fetchOrderByShop(...) → orderContext
           │
           ├── buildPrompt(chunks, message, historyForPrompt, { orderContext })   ← 10. Use chunks as context
           │   └── app/lib/prompt-builder.server.js
           │       └── buildPrompt(chunks, message, history, opts)
           │           └── context = chunks.map(c => `[Store Info i]\nLink: ${url}\n${c.chunk_text}`).join("\n\n")
           │           └── messages = [ system(systemPrompt + context + orderContext), ...history, user(message) ]
           │           → messages passed to LLM (no vectors, only text)
           │
           ├── callOpenRouterChat(promptMessages, shop.plan)
           │   └── app/lib/chat.server.js
           │       └── callOpenRouterChat(messages, plan)
           │           └── fetch(OpenRouter chat/completions, { model, messages, max_tokens, temperature })
           │           → { reply, model, usage, cost }
           │
           ├── prisma.chatMessage.create({ sessionId, role: 'assistant', messageText: reply, ... })
           ├── For each chunk: prisma.retrievedChunk.create({ chatMessageId, chunkId, similarityScore, rank })
           ├── prisma.openRouterCall.create({ endpoint: 'chat', ... })
           ├── deductCreditsForReply(prisma, shop.id, totalChargedCents)
           └── return { reply, sessionId }
               │
4. Back to API route
   └── app/routes/api.chat.jsx
       └── return cors(Response.json({ reply: result.reply, sessionId: result.sessionId }))
           │
5. Storefront
   └── chat-widget.js: sessionId = data.sessionId; addMessage(data.reply, 'bot');

Step	File	Function / Action	Role of vectors
1	`extensions/appifire-chat/assets/chat-widget.js`	`sendMessage()`	Sends `message` in POST body (no vectors).
2	`app/routes/api.chat.jsx`	`action()`	Validates input, loads shop, calls `generateChatReply`.
3a	`app/lib/rag.server.js`	`generateChatReply()`	Orchestrates: embed → search → prompt → LLM → save.
3b	`app/lib/embeddings.server.js`	`getEmbeddingsBatched([message])`	Produces query vector (1536-d) via OpenRouter.
3c	`app/lib/vector-search.server.js`	`searchSimilarChunks({ shopId, queryVector, topK: 5 })`	Uses query vector in pgvector `<=>` search; returns top-5 chunks + similarity.
3d	`app/lib/prompt-builder.server.js`	`buildPrompt(chunks, message, history, opts)`	Uses chunk text (and source_url) in system context; no vectors.
3e	`app/lib/chat.server.js`	`callOpenRouterChat(messages, plan)`	Sends text-only messages to LLM; returns reply.

Vector Search Detail

Operator: <=> is cosine distance (pgvector). The code uses (1 - (e.embedding <=> $1::vector)) so the result is a similarity in [0, 1].
Scope: Only chunks for kc.shop_id = $2::uuid are considered.
Output: Up to topK (default 5) rows: id, chunk_text, product_id, source_url, similarity. This is the only place the stored vectors in embeddings.embedding are read during chat.

Where Vectors Are Created (Ingestion, Not Chat)

Vectors are not created in the chat path; they are created when knowledge is ingested:

app/lib/ingestion.server.js
- ingestProduct, ingestCollection, ingestArticle, ingestWebsiteKnowledge
- Each builds text chunks, calls getEmbeddingsBatched(chunkTexts), then inserts into knowledge_chunks and embeddings (raw SQL INSERT ... embedding = $vector::vector).

So the full picture is: ingestion writes vectors for each chunk; at chat time we embed only the user message, run one vector search, and use the retrieved chunk texts (and URLs) as RAG context for the LLM.

How the Codebase Uses Vector Information to Answer Users in Chat

Overview

Data Model (Where Vectors Live)

Full Call Stack (Vector-Related Path)

Summary Table (Vector-Related Steps)

Vector Search Detail

Where Vectors Are Created (Ingestion, Not Chat)