Option A — Phase 3: RAG & Chat API

This document is the detailed plan for Phase 3 of AppiFire AI Chat (Option A). Prerequisites: Phase 2 complete (products synced and embeddings stored in pgvector).

Phase 3 objective

Build the POST /api/chat endpoint that: validates input, looks up the shop, enforces reply and spend limits, embeds the visitor’s message, retrieves the most relevant product chunks from pgvector, calls OpenRouter for the AI reply, logs all AI costs to openrouter_calls, and returns the answer to the chat widget.

RAG flow

flowchart TD
    A["POST /api/chat"] --> B["Validate input\n(length, required fields)"]
    B --> C["Look up shop\nby shopDomain"]
    C --> D["Check reply limit\n+ credit balance (paid)"]
    D -->|"over limit"| X["Return 429"]
    D -->|"ok"| E["Embed user message\n+ log embed cost to openrouter_calls"]
    E --> F["pgvector search\ntop-5 chunks by shop_id"]
    F -->|"0 results"| Y["Return canned fallback\n(no LLM call)"]
    F -->|"results"| G["Build prompt\nsystem + context + history"]
    G --> H["OpenRouter chat\nmodel by plan"]
    H --> I["Save chat_messages\nretrieved_chunks\nopenrouter_calls"]
    I --> J["Low-balance check\n(stub for Phase 5)"]
    J --> K["Return reply + sessionId"]

1. Create the chat route

1.1 Route file

Create app/routes/api.chat.jsx — a public endpoint accessible by the storefront widget.

Input validation (max message length of 2000 characters) is done here before any AI call so a very long message cannot inflate embedding costs.

The shop is loaded with an explicit select so all billing fields needed by generateChatReply are present. Fields added in Phase 4 (dailySpendLimitCents, addOnCredits, etc.) will be automatically available once the Phase 4 migration runs.

import prisma from "../db.server.js";
import { generateChatReply } from "../lib/rag.server.js";
import { cors } from "../lib/cors.server.js";

export const action = async ({ request }) => {
  // Allow storefront CORS (any *.myshopify.com or custom domain)
  if (request.method === "OPTIONS") {
    return cors(new Response(null, { status: 204 }), request);
  }

  if (request.method !== "POST") {
    return cors(new Response("Method not allowed", { status: 405 }), request);
  }

  const body = await request.json();
  const { shopDomain, visitorId, sessionId, message } = body;

  if (!shopDomain || !message) {
    return cors(
      new Response(JSON.stringify({ error: "shopDomain and message are required" }), { status: 400 }),
      request
    );
  }

  // Guard against very long messages — prevents runaway embedding costs
  if (message.length > 2000) {
    return cors(
      new Response(JSON.stringify({ error: "Message too long (max 2000 chars)" }), { status: 400 }),
      request
    );
  }

  // Load shop with all fields needed by generateChatReply.
  // dailySpendLimitCents, addOnCredits etc. are added by the Phase 4 migration.
  const shop = await prisma.shop.findFirst({
    where: { shopDomain, status: "active" },
    select: {
      id: true,
      shopDomain: true,
      plan: true,
      status: true,
      // Phase 4/5 billing fields (null if migration not yet run)
      addOnCredits: true,
      dailySpendLimitCents: true,
      emailAlertOnMinBalance: true,
      minBalanceForAlertReplies: true,
      lastLowBalanceAlertAt: true,
    },
  });

  if (!shop) {
    return cors(
      new Response(JSON.stringify({ error: "Shop not found" }), { status: 404 }),
      request
    );
  }

  try {
    const result = await generateChatReply({ shop, visitorId, sessionId, message });
    return cors(
      new Response(JSON.stringify({ reply: result.reply, sessionId: result.sessionId }), {
        status: 200,
        headers: { "Content-Type": "application/json" },
      }),
      request
    );
  } catch (err) {
    if (err.code === "REPLY_LIMIT_EXCEEDED") {
      return cors(
        new Response(
          JSON.stringify({ error: "Monthly reply limit reached. Please upgrade your plan." }),
          { status: 429 }
        ),
        request
      );
    }
    if (err.code === "DAILY_SPEND_LIMIT_EXCEEDED") {
      return cors(
        new Response(
          JSON.stringify({ error: "Daily spend limit reached. The store owner can increase it in Settings or try again tomorrow." }),
          { status: 429 }
        ),
        request
      );
    }
    console.error("[api/chat] error:", err);
    return cors(new Response(JSON.stringify({ error: "Something went wrong" }), { status: 500 }), request);
  }
};

// Allow GET for health check
export const loader = async ({ request }) => {
  if (request.method === "OPTIONS") return cors(new Response(null, { status: 204 }), request);
  return cors(new Response(JSON.stringify({ status: "ok" }), { status: 200 }), request);
};

1.2 CORS helper

Create app/lib/cors.server.js:

export function cors(response, request) {
  const origin = request.headers.get("Origin") || "*";
  const headers = new Headers(response.headers);
  headers.set("Access-Control-Allow-Origin", origin);
  headers.set("Access-Control-Allow-Methods", "GET, POST, OPTIONS");
  headers.set("Access-Control-Allow-Headers", "Content-Type, Authorization");
  return new Response(response.body, { status: response.status, headers });
}

2. Reply limit and spend limit enforcement

Create app/lib/limits.server.js with two guards called at the start of generateChatReply.

// Free: 50 replies/mo. Paid: 500 from subscription + add-on credits.
const PLAN_LIMITS = {
  free: 50,
  paid: 500, // base included; add addOnCredits for total allowance
};

export async function checkReplyLimit(prisma, shopId, plan, addOnCredits = 0) {
  const baseLimit = PLAN_LIMITS[plan] ?? 50;
  const limit = plan === "paid" ? baseLimit + addOnCredits : baseLimit;
  if (limit === -1) return; // -1 = unlimited

  const startOfMonth = new Date();
  startOfMonth.setDate(1);
  startOfMonth.setHours(0, 0, 0, 0);

  const count = await prisma.chatMessage.count({
    where: {
      role: "assistant",
      createdAt: { gte: startOfMonth },
      session: { shopId },
    },
  });

  if (count >= limit) {
    const err = new Error("Monthly reply limit exceeded");
    err.code = "REPLY_LIMIT_EXCEEDED";
    throw err;
  }
}

// Daily spend limit (from Settings → Credits & spending). Call before generating reply.
// estimatedCostCents: worst-case estimate for this reply (default ~$0.02).
export async function checkDailySpendLimit(prisma, shopId, dailySpendLimitCents, estimatedCostCents = 2) {
  if (dailySpendLimitCents == null || dailySpendLimitCents <= 0) return;

  const startOfToday = new Date();
  startOfToday.setHours(0, 0, 0, 0);

  const todaySum = await prisma.openRouterCall.aggregate({
    where: { shopId, createdAt: { gte: startOfToday } },
    _sum: { chargedCost: true },
  });

  const todayCents = Math.round((todaySum._sum.chargedCost ?? 0) * 100);
  if (todayCents + estimatedCostCents > dailySpendLimitCents) {
    const err = new Error("Daily spend limit reached. Try again tomorrow or increase the limit in Settings.");
    err.code = "DAILY_SPEND_LIMIT_EXCEEDED";
    throw err;
  }
}

Both functions are imported into rag.server.js and called at the very start of generateChatReply before any AI calls.

3. Embed the user query

Use getEmbeddingsBatched (from app/lib/embeddings.server.js) instead of getEmbeddings. getEmbeddingsBatched returns { vectors, totalTokens, totalCost } so the embedding cost can be logged to openrouter_calls with purpose: "Customer Chat", consistent with how product ingestion logs embeddings as purpose: "Product Knowledge".

// In rag.server.js — step 5 (see full code in section 7)
const { vectors: [queryVector], totalTokens: embedTokens, totalCost: embedCost } =
  await getEmbeddingsBatched([message]);

Immediately after, create an openRouterCall row for the query embedding (section 7 shows this in context).

4. pgvector similarity search

Create app/lib/vector-search.server.js:

import prisma from "../db.server.js";

export async function searchSimilarChunks({ shopId, queryVector, topK = 5 }) {
  // Use raw SQL because Prisma does not support pgvector operators
  const rows = await prisma.$queryRaw`
    SELECT
      kc.id,
      kc.chunk_text,
      kc.product_id,
      1 - (e.embedding <=> ${JSON.stringify(queryVector)}::vector) AS similarity
    FROM embeddings e
    JOIN knowledge_chunks kc ON kc.id = e.chunk_id
    WHERE kc.shop_id = ${shopId}::uuid
    ORDER BY e.embedding <=> ${JSON.stringify(queryVector)}::vector
    LIMIT ${topK}
  `;
  return rows;
}

The <=> operator is cosine distance (requires pgvector). Returns rows sorted by most-similar first.

Empty context fallback: If searchSimilarChunks returns 0 rows (e.g. no products ingested yet, or the query is completely unrelated), generateChatReply returns a canned message instead of making an LLM call. This prevents wasted spend and confusing generic replies (see section 7 for code).

5. Build the prompt

Create app/lib/prompt-builder.server.js:

const SYSTEM_PROMPT = `You are a helpful AI shopping assistant for this Shopify store.
Answer customer questions using only the product information provided below.
If the answer is not in the context, say you're not sure and suggest they contact support.
Be concise, friendly, and helpful.`;

export function buildPrompt(chunks, message, history = []) {
  const context = chunks
    .map((c, i) => `[Product Info ${i + 1}]\n${c.chunk_text}`)
    .join("\n\n");

  const messages = [
    { role: "system", content: `${SYSTEM_PROMPT}\n\n--- Product Context ---\n${context}` },
    ...history.slice(-6), // last 3 exchanges (6 messages)
    { role: "user", content: message },
  ];

  return messages;
}

History: Load the last N messages from chat_messages for the session so the LLM has conversation context.

6. OpenRouter chat API call

Create app/lib/chat.server.js.

Model names are read from .env so they can be changed without code deploys (set LLM_MODEL_FREE and LLM_MODEL_PAID).

const CHAT_API_URL = "https://openrouter.ai/api/v1/chat/completions";

// Read model names from .env with sensible defaults
function getPlanModels() {
  return {
    free: process.env.LLM_MODEL_FREE || "mistralai/mistral-small-3.2-24b:free",
    paid: process.env.LLM_MODEL_PAID || "openai/gpt-4o-mini",
  };
}

export async function callOpenRouterChat(messages, plan = "free") {
  const PLAN_MODELS = getPlanModels();
  const model = PLAN_MODELS[plan] ?? PLAN_MODELS.free;

  const res = await fetch(CHAT_API_URL, {
    method: "POST",
    headers: {
      Authorization: `Bearer ${process.env.OPENROUTER_API_KEY}`,
      "Content-Type": "application/json",
      "HTTP-Referer": process.env.SHOPIFY_APP_URL || "https://ai-chat.appifire.com",
      "X-Title": "AppiFire AI Chat",
    },
    body: JSON.stringify({ model, messages, max_tokens: 512, temperature: 0.3 }),
  });

  if (!res.ok) {
    const err = await res.text();
    throw new Error(`OpenRouter chat failed: ${res.status} ${err}`);
  }

  const data = await res.json();
  const reply = data.choices[0].message.content;
  const usage = data.usage; // { prompt_tokens, completion_tokens, total_tokens }
  const cost = data.usage?.cost ?? 0; // OpenRouter returns cost in USD

  return { reply, model, usage, cost };
}

7. Write to DB and return reply

Create app/lib/rag.server.js — the main orchestrator.

Key additions vs. the original plan:

Both checkReplyLimit and checkDailySpendLimit are called before any AI work.
Query embedding uses getEmbeddingsBatched to capture cost/tokens.
A dedicated openRouterCall row is created for the query embedding (purpose: "Customer Chat"), in addition to the one for the chat completion.
If searchSimilarChunks returns 0 results, a canned fallback is returned without calling the LLM.
A // TODO Phase 5 stub marks where the low-balance email check will be wired in.

import prisma from "../db.server.js";
import { checkReplyLimit, checkDailySpendLimit } from "./limits.server.js";
import { getEmbeddingsBatched } from "./embeddings.server.js";
import { searchSimilarChunks } from "./vector-search.server.js";
import { buildPrompt } from "./prompt-builder.server.js";
import { callOpenRouterChat } from "./chat.server.js";

const EMBEDDING_MODEL = "openai/text-embedding-3-small";

// Read markup factors from .env (same pattern as ingestion.server.js)
function getChatMarkupFactor() {
  const v = process.env.CHAT_MARKUP_FACTOR;
  const n = v != null && v !== "" ? parseFloat(v) : 2;
  return Number.isFinite(n) && n > 0 ? n : 2;
}

function getEmbeddingMarkupFactor() {
  const v = process.env.EMBEDDING_MARKUP_FACTOR;
  const n = v != null && v !== "" ? parseFloat(v) : 2;
  return Number.isFinite(n) && n > 0 ? n : 2;
}

export async function generateChatReply({ shop, visitorId, sessionId, message }) {
  // 1. Enforce monthly reply limit
  await checkReplyLimit(prisma, shop.id, shop.plan, shop.addOnCredits ?? 0);

  // 2. Enforce daily spend limit (null = no limit; provided by Phase 4 settings)
  await checkDailySpendLimit(prisma, shop.id, shop.dailySpendLimitCents ?? null);

  // 3. Get or create chat session
  let session;
  if (sessionId) {
    session = await prisma.chatSession.findFirst({ where: { id: sessionId, shopId: shop.id } });
  }
  if (!session) {
    session = await prisma.chatSession.create({
      data: { shopId: shop.id, visitorId: visitorId || "anonymous", startedAt: new Date() },
    });
  }

  // 4. Save user message
  await prisma.chatMessage.create({
    data: { sessionId: session.id, role: "user", messageText: message },
  });

  // 5. Load conversation history (last 3 exchanges = 6 messages)
  const history = await prisma.chatMessage.findMany({
    where: { sessionId: session.id },
    orderBy: { createdAt: "asc" },
    take: 10,
    select: { role: true, messageText: true },
  });
  const historyForPrompt = history.map((m) => ({ role: m.role, content: m.messageText }));

  // 6. Embed query — use getEmbeddingsBatched to capture cost/tokens for billing
  const { vectors: [queryVector], totalTokens: embedTokens, totalCost: embedCost } =
    await getEmbeddingsBatched([message]);

  // Log query embedding cost to openrouter_calls (counts towards daily spend limit)
  const embedMarkup = getEmbeddingMarkupFactor();
  await prisma.openRouterCall.create({
    data: {
      shopId: shop.id,
      endpoint: "embeddings",
      model: EMBEDDING_MODEL,
      openrouterCost: embedCost,
      chargedCost: embedCost * embedMarkup,
      markupFactor: embedMarkup,
      tokens: embedTokens || null,
      purpose: "Customer Chat",
    },
  });

  // 7. Vector search
  const chunks = await searchSimilarChunks({ shopId: shop.id, queryVector, topK: 5 });

  // 8. Empty-context fallback — avoid an LLM call when no knowledge exists for this query
  if (chunks.length === 0) {
    return {
      reply: "I don't have enough information to answer that. Please contact the store's support team.",
      sessionId: session.id,
    };
  }

  // 9. Build prompt and call LLM
  const promptMessages = buildPrompt(chunks, message, historyForPrompt);
  const { reply, model, usage, cost: openrouterCost } = await callOpenRouterChat(promptMessages, shop.plan);

  const chatMarkup = getChatMarkupFactor();
  const chargedCost = openrouterCost * chatMarkup;

  // 10. Save assistant message (with cost fields)
  const assistantMessage = await prisma.chatMessage.create({
    data: {
      sessionId: session.id,
      role: "assistant",
      messageText: reply,
      modelUsed: model,
      openrouterCost,
      chargedCost,
      markupFactor: chatMarkup,
    },
  });

  // 11. Save retrieved chunks (audit trail — shows which context drove this reply)
  for (let i = 0; i < chunks.length; i++) {
    await prisma.retrievedChunk.create({
      data: {
        chatMessageId: assistantMessage.id,
        chunkId: chunks[i].id,
        similarityScore: chunks[i].similarity,
        rank: i + 1,
      },
    });
  }

  // 12. Log chat LLM call to openrouter_calls
  await prisma.openRouterCall.create({
    data: {
      shopId: shop.id,
      chatMessageId: assistantMessage.id,
      endpoint: "chat",
      model,
      openrouterCost,
      chargedCost,
      markupFactor: chatMarkup,
      tokens: usage?.total_tokens ?? null,
      purpose: "Customer Chat",
    },
  });

  // TODO Phase 5: check shop.emailAlertOnMinBalance and send low-balance alert if needed

  return { reply, sessionId: session.id };
}

8. Prisma schema (already implemented)

All cost fields on ChatMessage were added in Phase 1 and already exist in prisma/schema.prisma. No new migration is needed for Phase 3.

model ChatMessage {
  // ... existing fields ...
  modelUsed       String?   @map("model_used")
  openrouterCost  Decimal?  @map("openrouter_cost") @db.Decimal(12, 6)
  chargedCost     Decimal?  @map("charged_cost")    @db.Decimal(12, 6)
  markupFactor    Decimal?  @map("markup_factor")   @db.Decimal(5, 2)
}

Verify with:

npx prisma studio

Open ChatMessage and confirm model_used, openrouter_cost, charged_cost, markup_factor columns exist.

9. Environment variables

Phase 3 adds four new .env keys. Add them to .env.sample (and your real .env):

# LLM model per plan (change without code deploy)
LLM_MODEL_FREE=mistralai/mistral-small-3.2-24b:free
LLM_MODEL_PAID=openai/gpt-4o-mini

CHAT_MARKUP_FACTOR and EMBEDDING_MARKUP_FACTOR were already added in Phase 2.

10. New helper files summary

File	Purpose
`app/routes/api.chat.jsx`	Public `POST /api/chat` endpoint; input validation, shop loading, error handling
`app/lib/cors.server.js`	CORS headers for storefront requests
`app/lib/limits.server.js`	`checkReplyLimit` (monthly cap) + `checkCreditBalance` (paid); daily spend limit not implemented
`app/lib/vector-search.server.js`	`searchSimilarChunks` — pgvector cosine search via raw SQL
`app/lib/prompt-builder.server.js`	`buildPrompt` — system + context + history
`app/lib/chat.server.js`	`callOpenRouterChat` — LLM API call; model read from `.env`
`app/lib/rag.server.js`	`generateChatReply` — full RAG orchestrator

11. Testing the endpoint

Start the server with npm run dev (or use your deployed URL).

Basic reply

curl -X POST https://ai-chat.appifire.com/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "shopDomain": "your-store.myshopify.com",
    "visitorId": "test-visitor-1",
    "message": "Do you have any blue running shoes?"
  }'

Expected:

{ "reply": "Yes! We carry ...", "sessionId": "uuid-of-session" }

Session continuation

Pass the sessionId from the first response to keep conversation history:

curl -X POST https://ai-chat.appifire.com/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "shopDomain": "your-store.myshopify.com",
    "visitorId": "test-visitor-1",
    "sessionId": "uuid-of-session",
    "message": "What sizes do they come in?"
  }'

Expected: the reply references the previous shoe question.

Reply limit exceeded (429)

Set the shop to plan: "free" in DB and manually insert 50 role: "assistant" rows for this month, then send another message. Expected HTTP 429 with "Monthly reply limit reached.".

Daily spend limit exceeded (429)

Set daily_spend_limit_cents = 1 on the shop row (= $0.01), then send a message. Expected HTTP 429 with "Daily spend limit reached.".

Input too long (400)

curl -X POST https://ai-chat.appifire.com/api/chat \
  -H "Content-Type: application/json" \
  -d "{\"shopDomain\":\"your-store.myshopify.com\",\"message\":\"$(python3 -c 'print(\"a\"*2001)')\"}"

Expected HTTP 400 with "Message too long (max 2000 chars)".

No context (empty knowledge base)

Delete all embeddings rows for the shop (or test before any sync), then send a message. Expected: canned fallback reply — no LLM call made (verify in openrouter_calls: only one row for the query embedding, no endpoint: "chat" row).

DB verification after a successful reply

-- One chat session
SELECT * FROM chat_sessions WHERE shop_id = '<uuid>';

-- User + assistant messages (assistant has cost fields set)
SELECT role, LEFT(message_text, 60), model_used, openrouter_cost, charged_cost
FROM chat_messages
WHERE session_id = '<session-uuid>'
ORDER BY created_at;

-- Retrieved chunks that drove the reply
SELECT rc.rank, rc.similarity_score, LEFT(kc.chunk_text, 80)
FROM retrieved_chunks rc
JOIN knowledge_chunks kc ON kc.id = rc.chunk_id
WHERE rc.chat_message_id = '<assistant-message-uuid>'
ORDER BY rc.rank;

-- Two openrouter_calls per reply: one "embeddings" + one "chat", both purpose = "Customer Chat"
SELECT endpoint, model, openrouter_cost, charged_cost, tokens, purpose
FROM openrouter_calls
WHERE shop_id = '<uuid>'
ORDER BY created_at DESC
LIMIT 5;

Checklist

Next: Option-A-Phase-4-Admin-Settings-and-Chat-Widget.md