Option A — Phase 3: RAG & Chat API
This document is the detailed plan for Phase 3 of AppiFire AI Chat (Option A). Prerequisites: Phase 2 complete (products synced and embeddings stored in pgvector).
Phase 3 objective
Section titled “Phase 3 objective”Build the POST /api/chat endpoint that: validates input, looks up the shop, enforces reply and spend limits, embeds the visitor’s message, retrieves the most relevant product chunks from pgvector, calls OpenRouter for the AI reply, logs all AI costs to openrouter_calls, and returns the answer to the chat widget.
RAG flow
Section titled “RAG flow”flowchart TD A["POST /api/chat"] --> B["Validate input\n(length, required fields)"] B --> C["Look up shop\nby shopDomain"] C --> D["Check reply limit\n+ credit balance (paid)"] D -->|"over limit"| X["Return 429"] D -->|"ok"| E["Embed user message\n+ log embed cost to openrouter_calls"] E --> F["pgvector search\ntop-5 chunks by shop_id"] F -->|"0 results"| Y["Return canned fallback\n(no LLM call)"] F -->|"results"| G["Build prompt\nsystem + context + history"] G --> H["OpenRouter chat\nmodel by plan"] H --> I["Save chat_messages\nretrieved_chunks\nopenrouter_calls"] I --> J["Low-balance check\n(stub for Phase 5)"] J --> K["Return reply + sessionId"]1. Create the chat route
Section titled “1. Create the chat route”1.1 Route file
Section titled “1.1 Route file”Create app/routes/api.chat.jsx — a public endpoint accessible by the storefront widget.
Input validation (max message length of 2000 characters) is done here before any AI call so a very long message cannot inflate embedding costs.
The shop is loaded with an explicit select so all billing fields needed by generateChatReply are present. Fields added in Phase 4 (dailySpendLimitCents, addOnCredits, etc.) will be automatically available once the Phase 4 migration runs.
import prisma from "../db.server.js";import { generateChatReply } from "../lib/rag.server.js";import { cors } from "../lib/cors.server.js";
export const action = async ({ request }) => { // Allow storefront CORS (any *.myshopify.com or custom domain) if (request.method === "OPTIONS") { return cors(new Response(null, { status: 204 }), request); }
if (request.method !== "POST") { return cors(new Response("Method not allowed", { status: 405 }), request); }
const body = await request.json(); const { shopDomain, visitorId, sessionId, message } = body;
if (!shopDomain || !message) { return cors( new Response(JSON.stringify({ error: "shopDomain and message are required" }), { status: 400 }), request ); }
// Guard against very long messages — prevents runaway embedding costs if (message.length > 2000) { return cors( new Response(JSON.stringify({ error: "Message too long (max 2000 chars)" }), { status: 400 }), request ); }
// Load shop with all fields needed by generateChatReply. // dailySpendLimitCents, addOnCredits etc. are added by the Phase 4 migration. const shop = await prisma.shop.findFirst({ where: { shopDomain, status: "active" }, select: { id: true, shopDomain: true, plan: true, status: true, // Phase 4/5 billing fields (null if migration not yet run) addOnCredits: true, dailySpendLimitCents: true, emailAlertOnMinBalance: true, minBalanceForAlertReplies: true, lastLowBalanceAlertAt: true, }, });
if (!shop) { return cors( new Response(JSON.stringify({ error: "Shop not found" }), { status: 404 }), request ); }
try { const result = await generateChatReply({ shop, visitorId, sessionId, message }); return cors( new Response(JSON.stringify({ reply: result.reply, sessionId: result.sessionId }), { status: 200, headers: { "Content-Type": "application/json" }, }), request ); } catch (err) { if (err.code === "REPLY_LIMIT_EXCEEDED") { return cors( new Response( JSON.stringify({ error: "Monthly reply limit reached. Please upgrade your plan." }), { status: 429 } ), request ); } if (err.code === "DAILY_SPEND_LIMIT_EXCEEDED") { return cors( new Response( JSON.stringify({ error: "Daily spend limit reached. The store owner can increase it in Settings or try again tomorrow." }), { status: 429 } ), request ); } console.error("[api/chat] error:", err); return cors(new Response(JSON.stringify({ error: "Something went wrong" }), { status: 500 }), request); }};
// Allow GET for health checkexport const loader = async ({ request }) => { if (request.method === "OPTIONS") return cors(new Response(null, { status: 204 }), request); return cors(new Response(JSON.stringify({ status: "ok" }), { status: 200 }), request);};1.2 CORS helper
Section titled “1.2 CORS helper”Create app/lib/cors.server.js:
export function cors(response, request) { const origin = request.headers.get("Origin") || "*"; const headers = new Headers(response.headers); headers.set("Access-Control-Allow-Origin", origin); headers.set("Access-Control-Allow-Methods", "GET, POST, OPTIONS"); headers.set("Access-Control-Allow-Headers", "Content-Type, Authorization"); return new Response(response.body, { status: response.status, headers });}2. Reply limit and spend limit enforcement
Section titled “2. Reply limit and spend limit enforcement”Create app/lib/limits.server.js with two guards called at the start of generateChatReply.
// Free: 50 replies/mo. Paid: 500 from subscription + add-on credits.const PLAN_LIMITS = { free: 50, paid: 500, // base included; add addOnCredits for total allowance};
export async function checkReplyLimit(prisma, shopId, plan, addOnCredits = 0) { const baseLimit = PLAN_LIMITS[plan] ?? 50; const limit = plan === "paid" ? baseLimit + addOnCredits : baseLimit; if (limit === -1) return; // -1 = unlimited
const startOfMonth = new Date(); startOfMonth.setDate(1); startOfMonth.setHours(0, 0, 0, 0);
const count = await prisma.chatMessage.count({ where: { role: "assistant", createdAt: { gte: startOfMonth }, session: { shopId }, }, });
if (count >= limit) { const err = new Error("Monthly reply limit exceeded"); err.code = "REPLY_LIMIT_EXCEEDED"; throw err; }}
// Daily spend limit (from Settings → Credits & spending). Call before generating reply.// estimatedCostCents: worst-case estimate for this reply (default ~$0.02).export async function checkDailySpendLimit(prisma, shopId, dailySpendLimitCents, estimatedCostCents = 2) { if (dailySpendLimitCents == null || dailySpendLimitCents <= 0) return;
const startOfToday = new Date(); startOfToday.setHours(0, 0, 0, 0);
const todaySum = await prisma.openRouterCall.aggregate({ where: { shopId, createdAt: { gte: startOfToday } }, _sum: { chargedCost: true }, });
const todayCents = Math.round((todaySum._sum.chargedCost ?? 0) * 100); if (todayCents + estimatedCostCents > dailySpendLimitCents) { const err = new Error("Daily spend limit reached. Try again tomorrow or increase the limit in Settings."); err.code = "DAILY_SPEND_LIMIT_EXCEEDED"; throw err; }}Both functions are imported into rag.server.js and called at the very start of generateChatReply before any AI calls.
3. Embed the user query
Section titled “3. Embed the user query”Use getEmbeddingsBatched (from app/lib/embeddings.server.js) instead of getEmbeddings. getEmbeddingsBatched returns { vectors, totalTokens, totalCost } so the embedding cost can be logged to openrouter_calls with purpose: "Customer Chat", consistent with how product ingestion logs embeddings as purpose: "Product Knowledge".
// In rag.server.js — step 5 (see full code in section 7)const { vectors: [queryVector], totalTokens: embedTokens, totalCost: embedCost } = await getEmbeddingsBatched([message]);Immediately after, create an openRouterCall row for the query embedding (section 7 shows this in context).
4. pgvector similarity search
Section titled “4. pgvector similarity search”Create app/lib/vector-search.server.js:
import prisma from "../db.server.js";
export async function searchSimilarChunks({ shopId, queryVector, topK = 5 }) { // Use raw SQL because Prisma does not support pgvector operators const rows = await prisma.$queryRaw` SELECT kc.id, kc.chunk_text, kc.product_id, 1 - (e.embedding <=> ${JSON.stringify(queryVector)}::vector) AS similarity FROM embeddings e JOIN knowledge_chunks kc ON kc.id = e.chunk_id WHERE kc.shop_id = ${shopId}::uuid ORDER BY e.embedding <=> ${JSON.stringify(queryVector)}::vector LIMIT ${topK} `; return rows;}The <=> operator is cosine distance (requires pgvector). Returns rows sorted by most-similar first.
Empty context fallback: If searchSimilarChunks returns 0 rows (e.g. no products ingested yet, or the query is completely unrelated), generateChatReply returns a canned message instead of making an LLM call. This prevents wasted spend and confusing generic replies (see section 7 for code).
5. Build the prompt
Section titled “5. Build the prompt”Create app/lib/prompt-builder.server.js:
const SYSTEM_PROMPT = `You are a helpful AI shopping assistant for this Shopify store.Answer customer questions using only the product information provided below.If the answer is not in the context, say you're not sure and suggest they contact support.Be concise, friendly, and helpful.`;
export function buildPrompt(chunks, message, history = []) { const context = chunks .map((c, i) => `[Product Info ${i + 1}]\n${c.chunk_text}`) .join("\n\n");
const messages = [ { role: "system", content: `${SYSTEM_PROMPT}\n\n--- Product Context ---\n${context}` }, ...history.slice(-6), // last 3 exchanges (6 messages) { role: "user", content: message }, ];
return messages;}History: Load the last N messages from chat_messages for the session so the LLM has conversation context.
6. OpenRouter chat API call
Section titled “6. OpenRouter chat API call”Create app/lib/chat.server.js.
Model names are read from .env so they can be changed without code deploys (set LLM_MODEL_FREE and LLM_MODEL_PAID).
const CHAT_API_URL = "https://openrouter.ai/api/v1/chat/completions";
// Read model names from .env with sensible defaultsfunction getPlanModels() { return { free: process.env.LLM_MODEL_FREE || "mistralai/mistral-small-3.2-24b:free", paid: process.env.LLM_MODEL_PAID || "openai/gpt-4o-mini", };}
export async function callOpenRouterChat(messages, plan = "free") { const PLAN_MODELS = getPlanModels(); const model = PLAN_MODELS[plan] ?? PLAN_MODELS.free;
const res = await fetch(CHAT_API_URL, { method: "POST", headers: { Authorization: `Bearer ${process.env.OPENROUTER_API_KEY}`, "Content-Type": "application/json", "HTTP-Referer": process.env.SHOPIFY_APP_URL || "https://ai-chat.appifire.com", "X-Title": "AppiFire AI Chat", }, body: JSON.stringify({ model, messages, max_tokens: 512, temperature: 0.3 }), });
if (!res.ok) { const err = await res.text(); throw new Error(`OpenRouter chat failed: ${res.status} ${err}`); }
const data = await res.json(); const reply = data.choices[0].message.content; const usage = data.usage; // { prompt_tokens, completion_tokens, total_tokens } const cost = data.usage?.cost ?? 0; // OpenRouter returns cost in USD
return { reply, model, usage, cost };}7. Write to DB and return reply
Section titled “7. Write to DB and return reply”Create app/lib/rag.server.js — the main orchestrator.
Key additions vs. the original plan:
- Both
checkReplyLimitandcheckDailySpendLimitare called before any AI work. - Query embedding uses
getEmbeddingsBatchedto capture cost/tokens. - A dedicated
openRouterCallrow is created for the query embedding (purpose: "Customer Chat"), in addition to the one for the chat completion. - If
searchSimilarChunksreturns 0 results, a canned fallback is returned without calling the LLM. - A
// TODO Phase 5stub marks where the low-balance email check will be wired in.
import prisma from "../db.server.js";import { checkReplyLimit, checkDailySpendLimit } from "./limits.server.js";import { getEmbeddingsBatched } from "./embeddings.server.js";import { searchSimilarChunks } from "./vector-search.server.js";import { buildPrompt } from "./prompt-builder.server.js";import { callOpenRouterChat } from "./chat.server.js";
const EMBEDDING_MODEL = "openai/text-embedding-3-small";
// Read markup factors from .env (same pattern as ingestion.server.js)function getChatMarkupFactor() { const v = process.env.CHAT_MARKUP_FACTOR; const n = v != null && v !== "" ? parseFloat(v) : 2; return Number.isFinite(n) && n > 0 ? n : 2;}
function getEmbeddingMarkupFactor() { const v = process.env.EMBEDDING_MARKUP_FACTOR; const n = v != null && v !== "" ? parseFloat(v) : 2; return Number.isFinite(n) && n > 0 ? n : 2;}
export async function generateChatReply({ shop, visitorId, sessionId, message }) { // 1. Enforce monthly reply limit await checkReplyLimit(prisma, shop.id, shop.plan, shop.addOnCredits ?? 0);
// 2. Enforce daily spend limit (null = no limit; provided by Phase 4 settings) await checkDailySpendLimit(prisma, shop.id, shop.dailySpendLimitCents ?? null);
// 3. Get or create chat session let session; if (sessionId) { session = await prisma.chatSession.findFirst({ where: { id: sessionId, shopId: shop.id } }); } if (!session) { session = await prisma.chatSession.create({ data: { shopId: shop.id, visitorId: visitorId || "anonymous", startedAt: new Date() }, }); }
// 4. Save user message await prisma.chatMessage.create({ data: { sessionId: session.id, role: "user", messageText: message }, });
// 5. Load conversation history (last 3 exchanges = 6 messages) const history = await prisma.chatMessage.findMany({ where: { sessionId: session.id }, orderBy: { createdAt: "asc" }, take: 10, select: { role: true, messageText: true }, }); const historyForPrompt = history.map((m) => ({ role: m.role, content: m.messageText }));
// 6. Embed query — use getEmbeddingsBatched to capture cost/tokens for billing const { vectors: [queryVector], totalTokens: embedTokens, totalCost: embedCost } = await getEmbeddingsBatched([message]);
// Log query embedding cost to openrouter_calls (counts towards daily spend limit) const embedMarkup = getEmbeddingMarkupFactor(); await prisma.openRouterCall.create({ data: { shopId: shop.id, endpoint: "embeddings", model: EMBEDDING_MODEL, openrouterCost: embedCost, chargedCost: embedCost * embedMarkup, markupFactor: embedMarkup, tokens: embedTokens || null, purpose: "Customer Chat", }, });
// 7. Vector search const chunks = await searchSimilarChunks({ shopId: shop.id, queryVector, topK: 5 });
// 8. Empty-context fallback — avoid an LLM call when no knowledge exists for this query if (chunks.length === 0) { return { reply: "I don't have enough information to answer that. Please contact the store's support team.", sessionId: session.id, }; }
// 9. Build prompt and call LLM const promptMessages = buildPrompt(chunks, message, historyForPrompt); const { reply, model, usage, cost: openrouterCost } = await callOpenRouterChat(promptMessages, shop.plan);
const chatMarkup = getChatMarkupFactor(); const chargedCost = openrouterCost * chatMarkup;
// 10. Save assistant message (with cost fields) const assistantMessage = await prisma.chatMessage.create({ data: { sessionId: session.id, role: "assistant", messageText: reply, modelUsed: model, openrouterCost, chargedCost, markupFactor: chatMarkup, }, });
// 11. Save retrieved chunks (audit trail — shows which context drove this reply) for (let i = 0; i < chunks.length; i++) { await prisma.retrievedChunk.create({ data: { chatMessageId: assistantMessage.id, chunkId: chunks[i].id, similarityScore: chunks[i].similarity, rank: i + 1, }, }); }
// 12. Log chat LLM call to openrouter_calls await prisma.openRouterCall.create({ data: { shopId: shop.id, chatMessageId: assistantMessage.id, endpoint: "chat", model, openrouterCost, chargedCost, markupFactor: chatMarkup, tokens: usage?.total_tokens ?? null, purpose: "Customer Chat", }, });
// TODO Phase 5: check shop.emailAlertOnMinBalance and send low-balance alert if needed
return { reply, sessionId: session.id };}8. Prisma schema (already implemented)
Section titled “8. Prisma schema (already implemented)”All cost fields on ChatMessage were added in Phase 1 and already exist in prisma/schema.prisma. No new migration is needed for Phase 3.
model ChatMessage { // ... existing fields ... modelUsed String? @map("model_used") openrouterCost Decimal? @map("openrouter_cost") @db.Decimal(12, 6) chargedCost Decimal? @map("charged_cost") @db.Decimal(12, 6) markupFactor Decimal? @map("markup_factor") @db.Decimal(5, 2)}Verify with:
npx prisma studioOpen ChatMessage and confirm model_used, openrouter_cost, charged_cost, markup_factor columns exist.
9. Environment variables
Section titled “9. Environment variables”Phase 3 adds four new .env keys. Add them to .env.sample (and your real .env):
# LLM model per plan (change without code deploy)LLM_MODEL_FREE=mistralai/mistral-small-3.2-24b:freeLLM_MODEL_PAID=openai/gpt-4o-miniCHAT_MARKUP_FACTOR and EMBEDDING_MARKUP_FACTOR were already added in Phase 2.
10. New helper files summary
Section titled “10. New helper files summary”| File | Purpose |
|---|---|
app/routes/api.chat.jsx | Public POST /api/chat endpoint; input validation, shop loading, error handling |
app/lib/cors.server.js | CORS headers for storefront requests |
app/lib/limits.server.js | checkReplyLimit (monthly cap) + checkCreditBalance (paid); daily spend limit not implemented |
app/lib/vector-search.server.js | searchSimilarChunks — pgvector cosine search via raw SQL |
app/lib/prompt-builder.server.js | buildPrompt — system + context + history |
app/lib/chat.server.js | callOpenRouterChat — LLM API call; model read from .env |
app/lib/rag.server.js | generateChatReply — full RAG orchestrator |
11. Testing the endpoint
Section titled “11. Testing the endpoint”Start the server with npm run dev (or use your deployed URL).
Basic reply
Section titled “Basic reply”curl -X POST https://ai-chat.appifire.com/api/chat \ -H "Content-Type: application/json" \ -d '{ "shopDomain": "your-store.myshopify.com", "visitorId": "test-visitor-1", "message": "Do you have any blue running shoes?" }'Expected:
{ "reply": "Yes! We carry ...", "sessionId": "uuid-of-session" }Session continuation
Section titled “Session continuation”Pass the sessionId from the first response to keep conversation history:
curl -X POST https://ai-chat.appifire.com/api/chat \ -H "Content-Type: application/json" \ -d '{ "shopDomain": "your-store.myshopify.com", "visitorId": "test-visitor-1", "sessionId": "uuid-of-session", "message": "What sizes do they come in?" }'Expected: the reply references the previous shoe question.
Reply limit exceeded (429)
Section titled “Reply limit exceeded (429)”Set the shop to plan: "free" in DB and manually insert 50 role: "assistant" rows for this month, then send another message. Expected HTTP 429 with "Monthly reply limit reached.".
Daily spend limit exceeded (429)
Section titled “Daily spend limit exceeded (429)”Set daily_spend_limit_cents = 1 on the shop row (= $0.01), then send a message. Expected HTTP 429 with "Daily spend limit reached.".
Input too long (400)
Section titled “Input too long (400)”curl -X POST https://ai-chat.appifire.com/api/chat \ -H "Content-Type: application/json" \ -d "{\"shopDomain\":\"your-store.myshopify.com\",\"message\":\"$(python3 -c 'print(\"a\"*2001)')\"}"Expected HTTP 400 with "Message too long (max 2000 chars)".
No context (empty knowledge base)
Section titled “No context (empty knowledge base)”Delete all embeddings rows for the shop (or test before any sync), then send a message. Expected: canned fallback reply — no LLM call made (verify in openrouter_calls: only one row for the query embedding, no endpoint: "chat" row).
DB verification after a successful reply
Section titled “DB verification after a successful reply”-- One chat sessionSELECT * FROM chat_sessions WHERE shop_id = '<uuid>';
-- User + assistant messages (assistant has cost fields set)SELECT role, LEFT(message_text, 60), model_used, openrouter_cost, charged_costFROM chat_messagesWHERE session_id = '<session-uuid>'ORDER BY created_at;
-- Retrieved chunks that drove the replySELECT rc.rank, rc.similarity_score, LEFT(kc.chunk_text, 80)FROM retrieved_chunks rcJOIN knowledge_chunks kc ON kc.id = rc.chunk_idWHERE rc.chat_message_id = '<assistant-message-uuid>'ORDER BY rc.rank;
-- Two openrouter_calls per reply: one "embeddings" + one "chat", both purpose = "Customer Chat"SELECT endpoint, model, openrouter_cost, charged_cost, tokens, purposeFROM openrouter_callsWHERE shop_id = '<uuid>'ORDER BY created_at DESCLIMIT 5;Checklist
Section titled “Checklist”-
app/routes/api.chat.jsxcreated and accessible at/api/chat; GET returns{"status":"ok"}. - Input validation: missing fields return 400; message > 2000 chars returns 400.
- CORS headers present on all responses (check with
curl -Ior browser console). -
app/lib/limits.server.jscreated withcheckReplyLimitandcheckCreditBalance(daily spend limit not implemented). - Both limits called at start of
generateChatReply; over-limit returns 429. -
app/lib/vector-search.server.jscreated;searchSimilarChunksreturns rows filtered byshop_id. - Empty context (0 chunks) returns canned fallback without creating an
openrouter_callschat row. -
app/lib/prompt-builder.server.jscreated; prompt includes system message, context chunks, and history. -
app/lib/chat.server.jscreated; model names read fromLLM_MODEL_FREE/LLM_MODEL_PAIDin.env. -
app/lib/rag.server.jscreated; full flow runs end-to-end. - Query embedding logged to
openrouter_callswithpurpose: "Customer Chat",endpoint: "embeddings". - Chat LLM call logged to
openrouter_callswithpurpose: "Customer Chat",endpoint: "chat", linked tochatMessageId. -
chat_sessions,chat_messages(user + assistant),retrieved_chunks,openrouter_calls(×2) all written to DB per reply. -
chatMessage.openrouterCost,chargedCost,markupFactorpopulated on assistant messages. - Session continuation works: passing
sessionIdfrom first reply keeps conversation history in prompt. -
curlbasic test returns a relevant product reply. -
curlwith over-limit shop returns 429REPLY_LIMIT_EXCEEDED. - (Daily spend limit not implemented; no 429 for daily cap.)
-
LLM_MODEL_FREE,LLM_MODEL_PAID,CHAT_MARKUP_FACTOR,EMBEDDING_MARKUP_FACTORset in.env.