Skip to content

Option A — Phase 3: RAG & Chat API

This document is the detailed plan for Phase 3 of AppiFire AI Chat (Option A). Prerequisites: Phase 2 complete (products synced and embeddings stored in pgvector).


Build the POST /api/chat endpoint that: validates input, looks up the shop, enforces reply and spend limits, embeds the visitor’s message, retrieves the most relevant product chunks from pgvector, calls OpenRouter for the AI reply, logs all AI costs to openrouter_calls, and returns the answer to the chat widget.


flowchart TD
A["POST /api/chat"] --> B["Validate input\n(length, required fields)"]
B --> C["Look up shop\nby shopDomain"]
C --> D["Check reply limit\n+ credit balance (paid)"]
D -->|"over limit"| X["Return 429"]
D -->|"ok"| E["Embed user message\n+ log embed cost to openrouter_calls"]
E --> F["pgvector search\ntop-5 chunks by shop_id"]
F -->|"0 results"| Y["Return canned fallback\n(no LLM call)"]
F -->|"results"| G["Build prompt\nsystem + context + history"]
G --> H["OpenRouter chat\nmodel by plan"]
H --> I["Save chat_messages\nretrieved_chunks\nopenrouter_calls"]
I --> J["Low-balance check\n(stub for Phase 5)"]
J --> K["Return reply + sessionId"]

Create app/routes/api.chat.jsx — a public endpoint accessible by the storefront widget.

Input validation (max message length of 2000 characters) is done here before any AI call so a very long message cannot inflate embedding costs.

The shop is loaded with an explicit select so all billing fields needed by generateChatReply are present. Fields added in Phase 4 (dailySpendLimitCents, addOnCredits, etc.) will be automatically available once the Phase 4 migration runs.

app/routes/api.chat.jsx
import prisma from "../db.server.js";
import { generateChatReply } from "../lib/rag.server.js";
import { cors } from "../lib/cors.server.js";
export const action = async ({ request }) => {
// Allow storefront CORS (any *.myshopify.com or custom domain)
if (request.method === "OPTIONS") {
return cors(new Response(null, { status: 204 }), request);
}
if (request.method !== "POST") {
return cors(new Response("Method not allowed", { status: 405 }), request);
}
const body = await request.json();
const { shopDomain, visitorId, sessionId, message } = body;
if (!shopDomain || !message) {
return cors(
new Response(JSON.stringify({ error: "shopDomain and message are required" }), { status: 400 }),
request
);
}
// Guard against very long messages — prevents runaway embedding costs
if (message.length > 2000) {
return cors(
new Response(JSON.stringify({ error: "Message too long (max 2000 chars)" }), { status: 400 }),
request
);
}
// Load shop with all fields needed by generateChatReply.
// dailySpendLimitCents, addOnCredits etc. are added by the Phase 4 migration.
const shop = await prisma.shop.findFirst({
where: { shopDomain, status: "active" },
select: {
id: true,
shopDomain: true,
plan: true,
status: true,
// Phase 4/5 billing fields (null if migration not yet run)
addOnCredits: true,
dailySpendLimitCents: true,
emailAlertOnMinBalance: true,
minBalanceForAlertReplies: true,
lastLowBalanceAlertAt: true,
},
});
if (!shop) {
return cors(
new Response(JSON.stringify({ error: "Shop not found" }), { status: 404 }),
request
);
}
try {
const result = await generateChatReply({ shop, visitorId, sessionId, message });
return cors(
new Response(JSON.stringify({ reply: result.reply, sessionId: result.sessionId }), {
status: 200,
headers: { "Content-Type": "application/json" },
}),
request
);
} catch (err) {
if (err.code === "REPLY_LIMIT_EXCEEDED") {
return cors(
new Response(
JSON.stringify({ error: "Monthly reply limit reached. Please upgrade your plan." }),
{ status: 429 }
),
request
);
}
if (err.code === "DAILY_SPEND_LIMIT_EXCEEDED") {
return cors(
new Response(
JSON.stringify({ error: "Daily spend limit reached. The store owner can increase it in Settings or try again tomorrow." }),
{ status: 429 }
),
request
);
}
console.error("[api/chat] error:", err);
return cors(new Response(JSON.stringify({ error: "Something went wrong" }), { status: 500 }), request);
}
};
// Allow GET for health check
export const loader = async ({ request }) => {
if (request.method === "OPTIONS") return cors(new Response(null, { status: 204 }), request);
return cors(new Response(JSON.stringify({ status: "ok" }), { status: 200 }), request);
};

Create app/lib/cors.server.js:

export function cors(response, request) {
const origin = request.headers.get("Origin") || "*";
const headers = new Headers(response.headers);
headers.set("Access-Control-Allow-Origin", origin);
headers.set("Access-Control-Allow-Methods", "GET, POST, OPTIONS");
headers.set("Access-Control-Allow-Headers", "Content-Type, Authorization");
return new Response(response.body, { status: response.status, headers });
}

2. Reply limit and spend limit enforcement

Section titled “2. Reply limit and spend limit enforcement”

Create app/lib/limits.server.js with two guards called at the start of generateChatReply.

// Free: 50 replies/mo. Paid: 500 from subscription + add-on credits.
const PLAN_LIMITS = {
free: 50,
paid: 500, // base included; add addOnCredits for total allowance
};
export async function checkReplyLimit(prisma, shopId, plan, addOnCredits = 0) {
const baseLimit = PLAN_LIMITS[plan] ?? 50;
const limit = plan === "paid" ? baseLimit + addOnCredits : baseLimit;
if (limit === -1) return; // -1 = unlimited
const startOfMonth = new Date();
startOfMonth.setDate(1);
startOfMonth.setHours(0, 0, 0, 0);
const count = await prisma.chatMessage.count({
where: {
role: "assistant",
createdAt: { gte: startOfMonth },
session: { shopId },
},
});
if (count >= limit) {
const err = new Error("Monthly reply limit exceeded");
err.code = "REPLY_LIMIT_EXCEEDED";
throw err;
}
}
// Daily spend limit (from Settings → Credits & spending). Call before generating reply.
// estimatedCostCents: worst-case estimate for this reply (default ~$0.02).
export async function checkDailySpendLimit(prisma, shopId, dailySpendLimitCents, estimatedCostCents = 2) {
if (dailySpendLimitCents == null || dailySpendLimitCents <= 0) return;
const startOfToday = new Date();
startOfToday.setHours(0, 0, 0, 0);
const todaySum = await prisma.openRouterCall.aggregate({
where: { shopId, createdAt: { gte: startOfToday } },
_sum: { chargedCost: true },
});
const todayCents = Math.round((todaySum._sum.chargedCost ?? 0) * 100);
if (todayCents + estimatedCostCents > dailySpendLimitCents) {
const err = new Error("Daily spend limit reached. Try again tomorrow or increase the limit in Settings.");
err.code = "DAILY_SPEND_LIMIT_EXCEEDED";
throw err;
}
}

Both functions are imported into rag.server.js and called at the very start of generateChatReply before any AI calls.


Use getEmbeddingsBatched (from app/lib/embeddings.server.js) instead of getEmbeddings. getEmbeddingsBatched returns { vectors, totalTokens, totalCost } so the embedding cost can be logged to openrouter_calls with purpose: "Customer Chat", consistent with how product ingestion logs embeddings as purpose: "Product Knowledge".

// In rag.server.js — step 5 (see full code in section 7)
const { vectors: [queryVector], totalTokens: embedTokens, totalCost: embedCost } =
await getEmbeddingsBatched([message]);

Immediately after, create an openRouterCall row for the query embedding (section 7 shows this in context).


Create app/lib/vector-search.server.js:

import prisma from "../db.server.js";
export async function searchSimilarChunks({ shopId, queryVector, topK = 5 }) {
// Use raw SQL because Prisma does not support pgvector operators
const rows = await prisma.$queryRaw`
SELECT
kc.id,
kc.chunk_text,
kc.product_id,
1 - (e.embedding <=> ${JSON.stringify(queryVector)}::vector) AS similarity
FROM embeddings e
JOIN knowledge_chunks kc ON kc.id = e.chunk_id
WHERE kc.shop_id = ${shopId}::uuid
ORDER BY e.embedding <=> ${JSON.stringify(queryVector)}::vector
LIMIT ${topK}
`;
return rows;
}

The <=> operator is cosine distance (requires pgvector). Returns rows sorted by most-similar first.

Empty context fallback: If searchSimilarChunks returns 0 rows (e.g. no products ingested yet, or the query is completely unrelated), generateChatReply returns a canned message instead of making an LLM call. This prevents wasted spend and confusing generic replies (see section 7 for code).


Create app/lib/prompt-builder.server.js:

const SYSTEM_PROMPT = `You are a helpful AI shopping assistant for this Shopify store.
Answer customer questions using only the product information provided below.
If the answer is not in the context, say you're not sure and suggest they contact support.
Be concise, friendly, and helpful.`;
export function buildPrompt(chunks, message, history = []) {
const context = chunks
.map((c, i) => `[Product Info ${i + 1}]\n${c.chunk_text}`)
.join("\n\n");
const messages = [
{ role: "system", content: `${SYSTEM_PROMPT}\n\n--- Product Context ---\n${context}` },
...history.slice(-6), // last 3 exchanges (6 messages)
{ role: "user", content: message },
];
return messages;
}

History: Load the last N messages from chat_messages for the session so the LLM has conversation context.


Create app/lib/chat.server.js.

Model names are read from .env so they can be changed without code deploys (set LLM_MODEL_FREE and LLM_MODEL_PAID).

const CHAT_API_URL = "https://openrouter.ai/api/v1/chat/completions";
// Read model names from .env with sensible defaults
function getPlanModels() {
return {
free: process.env.LLM_MODEL_FREE || "mistralai/mistral-small-3.2-24b:free",
paid: process.env.LLM_MODEL_PAID || "openai/gpt-4o-mini",
};
}
export async function callOpenRouterChat(messages, plan = "free") {
const PLAN_MODELS = getPlanModels();
const model = PLAN_MODELS[plan] ?? PLAN_MODELS.free;
const res = await fetch(CHAT_API_URL, {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.OPENROUTER_API_KEY}`,
"Content-Type": "application/json",
"HTTP-Referer": process.env.SHOPIFY_APP_URL || "https://ai-chat.appifire.com",
"X-Title": "AppiFire AI Chat",
},
body: JSON.stringify({ model, messages, max_tokens: 512, temperature: 0.3 }),
});
if (!res.ok) {
const err = await res.text();
throw new Error(`OpenRouter chat failed: ${res.status} ${err}`);
}
const data = await res.json();
const reply = data.choices[0].message.content;
const usage = data.usage; // { prompt_tokens, completion_tokens, total_tokens }
const cost = data.usage?.cost ?? 0; // OpenRouter returns cost in USD
return { reply, model, usage, cost };
}

Create app/lib/rag.server.js — the main orchestrator.

Key additions vs. the original plan:

  • Both checkReplyLimit and checkDailySpendLimit are called before any AI work.
  • Query embedding uses getEmbeddingsBatched to capture cost/tokens.
  • A dedicated openRouterCall row is created for the query embedding (purpose: "Customer Chat"), in addition to the one for the chat completion.
  • If searchSimilarChunks returns 0 results, a canned fallback is returned without calling the LLM.
  • A // TODO Phase 5 stub marks where the low-balance email check will be wired in.
import prisma from "../db.server.js";
import { checkReplyLimit, checkDailySpendLimit } from "./limits.server.js";
import { getEmbeddingsBatched } from "./embeddings.server.js";
import { searchSimilarChunks } from "./vector-search.server.js";
import { buildPrompt } from "./prompt-builder.server.js";
import { callOpenRouterChat } from "./chat.server.js";
const EMBEDDING_MODEL = "openai/text-embedding-3-small";
// Read markup factors from .env (same pattern as ingestion.server.js)
function getChatMarkupFactor() {
const v = process.env.CHAT_MARKUP_FACTOR;
const n = v != null && v !== "" ? parseFloat(v) : 2;
return Number.isFinite(n) && n > 0 ? n : 2;
}
function getEmbeddingMarkupFactor() {
const v = process.env.EMBEDDING_MARKUP_FACTOR;
const n = v != null && v !== "" ? parseFloat(v) : 2;
return Number.isFinite(n) && n > 0 ? n : 2;
}
export async function generateChatReply({ shop, visitorId, sessionId, message }) {
// 1. Enforce monthly reply limit
await checkReplyLimit(prisma, shop.id, shop.plan, shop.addOnCredits ?? 0);
// 2. Enforce daily spend limit (null = no limit; provided by Phase 4 settings)
await checkDailySpendLimit(prisma, shop.id, shop.dailySpendLimitCents ?? null);
// 3. Get or create chat session
let session;
if (sessionId) {
session = await prisma.chatSession.findFirst({ where: { id: sessionId, shopId: shop.id } });
}
if (!session) {
session = await prisma.chatSession.create({
data: { shopId: shop.id, visitorId: visitorId || "anonymous", startedAt: new Date() },
});
}
// 4. Save user message
await prisma.chatMessage.create({
data: { sessionId: session.id, role: "user", messageText: message },
});
// 5. Load conversation history (last 3 exchanges = 6 messages)
const history = await prisma.chatMessage.findMany({
where: { sessionId: session.id },
orderBy: { createdAt: "asc" },
take: 10,
select: { role: true, messageText: true },
});
const historyForPrompt = history.map((m) => ({ role: m.role, content: m.messageText }));
// 6. Embed query — use getEmbeddingsBatched to capture cost/tokens for billing
const { vectors: [queryVector], totalTokens: embedTokens, totalCost: embedCost } =
await getEmbeddingsBatched([message]);
// Log query embedding cost to openrouter_calls (counts towards daily spend limit)
const embedMarkup = getEmbeddingMarkupFactor();
await prisma.openRouterCall.create({
data: {
shopId: shop.id,
endpoint: "embeddings",
model: EMBEDDING_MODEL,
openrouterCost: embedCost,
chargedCost: embedCost * embedMarkup,
markupFactor: embedMarkup,
tokens: embedTokens || null,
purpose: "Customer Chat",
},
});
// 7. Vector search
const chunks = await searchSimilarChunks({ shopId: shop.id, queryVector, topK: 5 });
// 8. Empty-context fallback — avoid an LLM call when no knowledge exists for this query
if (chunks.length === 0) {
return {
reply: "I don't have enough information to answer that. Please contact the store's support team.",
sessionId: session.id,
};
}
// 9. Build prompt and call LLM
const promptMessages = buildPrompt(chunks, message, historyForPrompt);
const { reply, model, usage, cost: openrouterCost } = await callOpenRouterChat(promptMessages, shop.plan);
const chatMarkup = getChatMarkupFactor();
const chargedCost = openrouterCost * chatMarkup;
// 10. Save assistant message (with cost fields)
const assistantMessage = await prisma.chatMessage.create({
data: {
sessionId: session.id,
role: "assistant",
messageText: reply,
modelUsed: model,
openrouterCost,
chargedCost,
markupFactor: chatMarkup,
},
});
// 11. Save retrieved chunks (audit trail — shows which context drove this reply)
for (let i = 0; i < chunks.length; i++) {
await prisma.retrievedChunk.create({
data: {
chatMessageId: assistantMessage.id,
chunkId: chunks[i].id,
similarityScore: chunks[i].similarity,
rank: i + 1,
},
});
}
// 12. Log chat LLM call to openrouter_calls
await prisma.openRouterCall.create({
data: {
shopId: shop.id,
chatMessageId: assistantMessage.id,
endpoint: "chat",
model,
openrouterCost,
chargedCost,
markupFactor: chatMarkup,
tokens: usage?.total_tokens ?? null,
purpose: "Customer Chat",
},
});
// TODO Phase 5: check shop.emailAlertOnMinBalance and send low-balance alert if needed
return { reply, sessionId: session.id };
}

All cost fields on ChatMessage were added in Phase 1 and already exist in prisma/schema.prisma. No new migration is needed for Phase 3.

model ChatMessage {
// ... existing fields ...
modelUsed String? @map("model_used")
openrouterCost Decimal? @map("openrouter_cost") @db.Decimal(12, 6)
chargedCost Decimal? @map("charged_cost") @db.Decimal(12, 6)
markupFactor Decimal? @map("markup_factor") @db.Decimal(5, 2)
}

Verify with:

Terminal window
npx prisma studio

Open ChatMessage and confirm model_used, openrouter_cost, charged_cost, markup_factor columns exist.


Phase 3 adds four new .env keys. Add them to .env.sample (and your real .env):

# LLM model per plan (change without code deploy)
LLM_MODEL_FREE=mistralai/mistral-small-3.2-24b:free
LLM_MODEL_PAID=openai/gpt-4o-mini

CHAT_MARKUP_FACTOR and EMBEDDING_MARKUP_FACTOR were already added in Phase 2.


FilePurpose
app/routes/api.chat.jsxPublic POST /api/chat endpoint; input validation, shop loading, error handling
app/lib/cors.server.jsCORS headers for storefront requests
app/lib/limits.server.jscheckReplyLimit (monthly cap) + checkCreditBalance (paid); daily spend limit not implemented
app/lib/vector-search.server.jssearchSimilarChunks — pgvector cosine search via raw SQL
app/lib/prompt-builder.server.jsbuildPrompt — system + context + history
app/lib/chat.server.jscallOpenRouterChat — LLM API call; model read from .env
app/lib/rag.server.jsgenerateChatReply — full RAG orchestrator

Start the server with npm run dev (or use your deployed URL).

Terminal window
curl -X POST https://ai-chat.appifire.com/api/chat \
-H "Content-Type: application/json" \
-d '{
"shopDomain": "your-store.myshopify.com",
"visitorId": "test-visitor-1",
"message": "Do you have any blue running shoes?"
}'

Expected:

{ "reply": "Yes! We carry ...", "sessionId": "uuid-of-session" }

Pass the sessionId from the first response to keep conversation history:

Terminal window
curl -X POST https://ai-chat.appifire.com/api/chat \
-H "Content-Type: application/json" \
-d '{
"shopDomain": "your-store.myshopify.com",
"visitorId": "test-visitor-1",
"sessionId": "uuid-of-session",
"message": "What sizes do they come in?"
}'

Expected: the reply references the previous shoe question.

Set the shop to plan: "free" in DB and manually insert 50 role: "assistant" rows for this month, then send another message. Expected HTTP 429 with "Monthly reply limit reached.".

Set daily_spend_limit_cents = 1 on the shop row (= $0.01), then send a message. Expected HTTP 429 with "Daily spend limit reached.".

Terminal window
curl -X POST https://ai-chat.appifire.com/api/chat \
-H "Content-Type: application/json" \
-d "{\"shopDomain\":\"your-store.myshopify.com\",\"message\":\"$(python3 -c 'print(\"a\"*2001)')\"}"

Expected HTTP 400 with "Message too long (max 2000 chars)".

Delete all embeddings rows for the shop (or test before any sync), then send a message. Expected: canned fallback reply — no LLM call made (verify in openrouter_calls: only one row for the query embedding, no endpoint: "chat" row).

-- One chat session
SELECT * FROM chat_sessions WHERE shop_id = '<uuid>';
-- User + assistant messages (assistant has cost fields set)
SELECT role, LEFT(message_text, 60), model_used, openrouter_cost, charged_cost
FROM chat_messages
WHERE session_id = '<session-uuid>'
ORDER BY created_at;
-- Retrieved chunks that drove the reply
SELECT rc.rank, rc.similarity_score, LEFT(kc.chunk_text, 80)
FROM retrieved_chunks rc
JOIN knowledge_chunks kc ON kc.id = rc.chunk_id
WHERE rc.chat_message_id = '<assistant-message-uuid>'
ORDER BY rc.rank;
-- Two openrouter_calls per reply: one "embeddings" + one "chat", both purpose = "Customer Chat"
SELECT endpoint, model, openrouter_cost, charged_cost, tokens, purpose
FROM openrouter_calls
WHERE shop_id = '<uuid>'
ORDER BY created_at DESC
LIMIT 5;

  • app/routes/api.chat.jsx created and accessible at /api/chat; GET returns {"status":"ok"}.
  • Input validation: missing fields return 400; message > 2000 chars returns 400.
  • CORS headers present on all responses (check with curl -I or browser console).
  • app/lib/limits.server.js created with checkReplyLimit and checkCreditBalance (daily spend limit not implemented).
  • Both limits called at start of generateChatReply; over-limit returns 429.
  • app/lib/vector-search.server.js created; searchSimilarChunks returns rows filtered by shop_id.
  • Empty context (0 chunks) returns canned fallback without creating an openrouter_calls chat row.
  • app/lib/prompt-builder.server.js created; prompt includes system message, context chunks, and history.
  • app/lib/chat.server.js created; model names read from LLM_MODEL_FREE / LLM_MODEL_PAID in .env.
  • app/lib/rag.server.js created; full flow runs end-to-end.
  • Query embedding logged to openrouter_calls with purpose: "Customer Chat", endpoint: "embeddings".
  • Chat LLM call logged to openrouter_calls with purpose: "Customer Chat", endpoint: "chat", linked to chatMessageId.
  • chat_sessions, chat_messages (user + assistant), retrieved_chunks, openrouter_calls (×2) all written to DB per reply.
  • chatMessage.openrouterCost, chargedCost, markupFactor populated on assistant messages.
  • Session continuation works: passing sessionId from first reply keeps conversation history in prompt.
  • curl basic test returns a relevant product reply.
  • curl with over-limit shop returns 429 REPLY_LIMIT_EXCEEDED.
  • (Daily spend limit not implemented; no 429 for daily cap.)
  • LLM_MODEL_FREE, LLM_MODEL_PAID, CHAT_MARKUP_FACTOR, EMBEDDING_MARKUP_FACTOR set in .env.

Next: Option-A-Phase-4-Admin-Settings-and-Chat-Widget.md