AppiFire AI Chat — Product Plan & Architecture Options (A, B, C)

This document is the single plan for building the AppiFire AI Chat Shopify app. It summarizes the product, requirements, and three architecture options for RAG + vector storage and LLM, so you can choose which path to follow.

1. Product Summary

What You Are Building

AppiFire AI Chat — a Shopify app that merchants install from the Shopify App Store. Marketing site: appifire.com; app backend: ai-chat.appifire.com (one subdomain per app).
Free (50 replies/month) and paid ($10/month subscription = $10 AI credits ≈ 500 messages; users can purchase add-on credits). Pricing is per message reply, not per product (see Section 8).
Storefront: A chat widget on the merchant’s store (Theme App Extension) where shoppers ask questions and get AI answers based on that store’s catalog and knowledge.
Shopify Admin: A configuration screen inside the Shopify admin panel where the merchant sets:
- Chat widget behavior (position, style, welcome message).
- Which knowledge sources to use (products only, or + FAQs, policies, etc.).
- Optional: model choice, rate limits, or feature toggles per plan.
RAG (Retrieval-Augmented Generation):
- Ingest store data (products, optional FAQs/policies) into a vector store.
- At query time: embed the question → similarity search → retrieve relevant chunks → send to an LLM with context → return answer.
LLM: You prefer using OpenRouter.ai for the chat model (access to multiple models, free and paid, single API).
Vector storage: You want to evaluate building vectors on your server (if cheaper) vs using an OpenAI/cloud vector solution — hence the three options below.

References (from `docs/my-rnd`)

Topic	Document
High-level dev plan, roadmap, tech stack	`my-rnd/readme.md`
Shopify setup, CLI, Theme Extension, webhooks, scopes	`my-rnd/Setup Shopify & App.md`
Multi-tenant RAG schema (shops, products, chunks, embeddings, chat)	Option-A-Phase-01-Design database schema for RAG.md
Free/paid tiers, positioning, GTM	`my-rnd/Marketing Plan.md`
Model comparison (cost tiers, OpenRouter free models)	`my-rnd/AI Chat Bot Comparions.md`
Competitors & Shopify docs links	`my-rnd/Docs & Competitors.md`

2. Requirements (Recap)

Distribution: Install from Shopify App Store (public app).
Monetization: Free tier + paid tier(s); configuration and/or limits can differ by plan.
Admin UX: Configuration screen in Shopify admin (embedded app, App Bridge).
RAG: OpenAI-compatible embeddings + vector search; context injected into prompts.
LLM: Prefer OpenRouter.ai for the chat model (flexibility, free/paid models).
Vector: Choose between self-hosted vectors on your server (if cheaper) vs managed/OpenAI vector — options below.

3. Plan Options (A, B, C)

Three ways to combine vector storage and embeddings with OpenRouter for the LLM. Your backend always does product sync, chunking, and RAG orchestration; only where vectors live and how you call the LLM change.

Plan A — Self-Hosted Vectors (Lowest Cost, Most Control)

Idea: Store embeddings on your own infrastructure. Use OpenRouter for both embeddings and chat so all AI billing is through OpenRouter; vectors live in your DB.

Component	Choice
Vector storage	pgvector (PostgreSQL extension) on your server or managed Postgres (e.g. Supabase, Neon, RDS). Same DB can hold `shops`, `products`, `knowledge_chunks` from Option-A-Phase-01-Design database schema for RAG.md.
Embeddings	OpenRouter.ai — e.g. `openai/text-embedding-3-small` or `openai/text-embedding-3-large` via OpenRouter’s embeddings API (`POST /api/v1/embeddings`). Same API shape as OpenAI; one bill with chat.
LLM (chat)	OpenRouter.ai — e.g. free/low-cost model for free tier (e.g. `x-ai/grok-4.1-fast:free`, `mistralai/mistral-small-3.2-24B:free`), and a better model for paid (e.g. Claude Haiku, GPT-4o-mini).
Backend	Single backend (Node/Remix or FastAPI) on your server: OAuth, product sync, webhooks, chunking, OpenRouter embeddings for ingestion, vector search (pgvector), and OpenRouter chat API.

Pros

Cheapest at scale: No per-embedding or per-index fee; you pay for DB and compute only.
Full control: Indexing, re-embedding on product update, tenant isolation (filter by shop_id) all in your schema.
Aligns with existing schema: Option-A-Phase-01-Design database schema for RAG.md already has embeddings with vector(1536) and ivfflat index — fits pgvector.

Cons

You operate and scale the database (or use managed Postgres).
You implement incremental re-embedding and index tuning (e.g. ivfflat lists).

Best for: You want minimal ongoing cost and are fine managing a Postgres DB. Best long-term if you have many stores and large catalogs.

Plan B — Managed Vector DB (Balanced: No Vector Ops)

Idea: Use a managed vector database (Pinecone, Weaviate Cloud, or similar). Use OpenRouter for both embeddings and chat so all AI billing is through OpenRouter.

Component	Choice
Vector storage	Pinecone or Weaviate Cloud (or Qdrant, Zilliz). One index per tenant or one index with `shop_id` in metadata; filter by `shop_id` on query.
Embeddings	OpenRouter.ai — same as A: e.g. `openai/text-embedding-3-small` via OpenRouter’s embeddings API. One bill with chat.
LLM (chat)	OpenRouter.ai — same strategy as A (free model for free tier, better model for paid).
Backend	Your server: product sync, chunking, OpenRouter embeddings, upsert to vector DB; at query time: vector search in managed DB, then build context and call OpenRouter chat. Relational DB (Postgres) for shops, products, knowledge_chunks metadata, chat sessions (no vector column).

Pros

No vector DB to run or tune; managed scaling and availability.
Fast to ship; many providers have free tiers for early stage.

Cons

Cost grows with vectors and queries (per index, per query, or per pod).
Less control over index type and exact consistency model.

Best for: You want to avoid any vector DB ops and are okay with per-usage cost. Good for MVP and early traction.

Plan C — OpenAI-Centric (Fastest to Ship, Least Control)

Idea: Use OpenAI for both embeddings and vector storage (e.g. Assistants API with file-based retrieval, or embeddings + a very simple store). LLM for chat remains OpenRouter as requested.

Component	Choice
Vector storage	OpenAI Assistants API “file” + vector store: upload per-store documents (or pre-built context files), use built-in retrieval. Or: OpenAI embeddings + OpenAI’s own vector store (if/when offered as a standalone product).
Embeddings	Handled inside OpenAI (Assistants ingest and embed).
LLM (chat)	OpenRouter.ai — you call OpenRouter for the final reply; context can be assembled from Assistants retrieval results, or you use embeddings + your own small store and only use OpenAI for embeddings.

C variant (hybrid): Use OpenAI embeddings + pgvector (like A) but keep the rest of the stack simple (single backend + OpenRouter). Then “Plan C” is “minimal moving parts” rather than “Assistants-only.”

Pros

Fastest path if you fully adopt Assistants (less code for ingestion and retrieval).
Single vendor for embeddings + retrieval (if you use Assistants end-to-end for RAG).

Cons

Assistants API is not designed for multi-tenant SaaS (per-store isolation and per-store re-sync need careful design).
Cost can be high (files + retrieval); less control over chunking and indexing.
You still want OpenRouter for the LLM, so you’d “retrieve with OpenAI, answer with OpenRouter” — two vendors anyway.

Best for: Quick prototype or if you explicitly want to try Assistants API. For a multi-tenant, plan-based product, A or B is usually better.

4. Recommendation Summary

If you want…	Prefer
Lowest cost and full control	Plan A — pgvector on your server (or managed Postgres), OpenAI embeddings, OpenRouter for chat.
No vector DB ops, faster launch	Plan B — Managed vector DB (Pinecone/Weaviate), OpenAI embeddings, OpenRouter for chat.
Fastest experiment with OpenAI retrieval	Plan C — Only if you’re okay with multi-tenant and cost constraints of Assistants; otherwise use A or B and keep OpenRouter for the model.

Practical suggestion: Start with Plan A (pgvector) if you already plan to run Postgres for shops/products and the RAG schema — it keeps cost low and matches your existing schema. If you prefer zero vector DB management at the beginning, choose Plan B and migrate to Plan A later if cost becomes an issue.

5. Infrastructure: RAM, CPU, and Will Plan A Slow Down the Server?

Plan A (pgvector on your server): what you need

Your server runs Postgres + pgvector and your app backend (Node/Remix or FastAPI). Vector search (similarity over embeddings) runs inside Postgres and uses CPU and RAM.

Scale (rough)	RAM	CPU	Notes
Small (MVP)	2–4 GB	2 vCPU	Few stores, < ~50K chunks total. Postgres + app on one box or small VPS (e.g. 2 GB RAM, 1–2 vCPU).
Medium	4–8 GB	2–4 vCPU	Dozens of stores, hundreds of thousands of chunks. Dedicated or larger Postgres; consider separating DB and app if needed.
Large	8–16+ GB	4+ vCPU	Many stores, millions of chunks. Postgres tuned for memory (shared_buffers, work_mem); app on separate instance.

Vector index (ivfflat): Build time and search use CPU; search is approximate so it stays relatively fast. HNSW (if your pgvector version supports it) often gives better query speed and can reduce CPU per query.
Rule of thumb: Reserve ~1–2 GB RAM for Postgres at minimum; the rest for OS and app. More vectors → more index in memory for fast search, so scale RAM as total vectors grow.

Will Plan A slow down the server?

It can, if the box is undersized or shared. Vector similarity search is more CPU- and memory-intensive than simple key lookups. Many concurrent searches or a large index with too little RAM will increase latency and can slow other work (e.g. app requests, webhooks) if everything runs on the same machine.
It doesn’t have to. With proper sizing:
- Use an index (ivfflat or HNSW) so you’re not doing full table scan.
- Give Postgres enough RAM (and tune work_mem for sorts/index builds).
- Run vector search only when needed (per chat message), not on every request.
Best practice: Put Postgres (with pgvector) on its own instance (or managed Postgres), and run the app on a separate instance. Then vector search load doesn’t compete with app logic; you scale DB and app independently. Many managed Postgres providers (Supabase, Neon, AWS RDS, etc.) support pgvector — you don’t have to run Postgres on the same box as the app.

So: Plan A can slow down the server if you run DB + app on one small machine and grow usage. With separate DB and app and adequate RAM/CPU, it’s fine.

Plan B (managed vector DB): what you need

Your server runs only the app + relational Postgres (shops, products, sessions — no vectors). The vector store is Pinecone, Weaviate Cloud, etc., so vector CPU/RAM live on their side.

Scale (rough)	Your server RAM	Your server CPU	Notes
Small (MVP)	1–2 GB	1–2 vCPU	App + small Postgres only; vector traffic is HTTP to managed service.
Medium / Large	2–4+ GB	2+ vCPU	Scale for app and relational DB only; vector load is handled by the managed DB.

Your server stays light: No vector index, no embedding storage. You only pay the managed vector DB for storage and queries (their pricing), and your app stays responsive.

Plan A vs Plan B: infrastructure pros and cons

	Plan A (pgvector on your side)	Plan B (managed vector DB)
RAM / CPU on your server	Higher. Postgres + pgvector + app need 2–4+ GB RAM and 2+ vCPU for real use; more as vectors grow.	Lower. App + relational DB only; 1–2 GB RAM and 1–2 vCPU can be enough for MVP.
Can it slow down the server?	Yes, if undersized or shared. Vector search is heavier than normal CRUD. Mitigate by: separate DB instance, enough RAM, proper index.	No. Vector work runs in the managed service; your server only does app logic and API calls.
Ops	You (or managed Postgres) handle backups, scaling, index tuning, pgvector upgrades.	No vector DB ops; provider handles scaling and availability.
Cost	Cheaper at scale: you pay for one (or two) servers or managed Postgres; no per-query or per-vector fee.	Usage-based: cost grows with vectors and queries (e.g. per pod, per 1K queries).
Scaling	Scale by upgrading DB instance (more RAM/CPU) or read replicas; you manage it.	Scale by provider plan or auto-scaling; you just use the API.
Latency	Vector search on same region as app is typically < 50–100 ms; no extra network hop to external vector DB.	One extra network call to Pinecone/Weaviate; usually still < 100–200 ms if same region.

Summary: Plan A needs more RAM/CPU on your side and can slow the server if the DB is undersized or shared with the app. Plan B keeps your server light (no vector load) but costs more as usage grows. For Plan A, avoid slowdown by sizing properly and, when possible, running Postgres (with pgvector) on a separate instance from the app.

Switching between Plan A and Plan B (you can go either way)

Yes — you can switch in both directions.

Direction	What you do	Effort
A → B	Export embeddings (and chunk IDs / metadata) from pgvector, then upsert them into Pinecone/Weaviate. Same vectors, same dimensions; only the storage and query API change. Point your backend at the managed vector API instead of Postgres. Relational data (shops, products, knowledge_chunks) can stay in Postgres.	One-time migration script + swap the vector client in code.
B → A	Export vectors (and IDs) from the managed DB (most offer export or a read API), insert into your `embeddings` table in pgvector, build the index (ivfflat/HNSW). Point your backend at pgvector instead of the managed API.	One-time migration script + swap the vector client in code.

Embeddings are reusable: You do not re-call the OpenAI embedding API for the migration. You move the same vectors from one store to the other; dimension (e.g. 1536) and chunk IDs stay the same.
Design for it: Add a vector store abstraction in your backend (e.g. VectorStore.upsert(chunks) and VectorStore.search(embedding, shopId, limit)). Implement it once for pgvector and once for Pinecone/Weaviate. Then switching is: run a one-time data copy + change config to use the other implementation. No rewrite of app logic.

So: either option gives you the option to go to the other later. Start with A or B; if cost or ops change, you can migrate with a one-time move of vectors and a swap of the vector-store implementation.

6. What Stays the Same in All Plans

Shopify app: Public app, install from App Store; Theme App Extension for storefront chat widget; embedded admin app for configuration screen.
Data model: Multi-tenant (per-shop), product sync via Admin API + webhooks, chunk-based RAG, audit trail (e.g. retrieved_chunks) as in Option-A-Phase-01-Design database schema for RAG.md.
LLM: OpenRouter.ai for chat (model choice by plan: free vs paid models as in my-rnd/AI Chat Bot Comparions.md).
Free vs paid: Enforced in your backend by reply count (included replies per month, then per-reply charge or upgrade). No product-based pricing; optional product limits for scope only (e.g. free tier catalog size) if you want.

7. API Key Strategy: Your Key vs Customer Key

Option 1: You use your own OpenAI / OpenRouter key (recommended)

How it works: Your backend holds one (or a few) API keys. Every store’s chat and embeddings go through your server; you call OpenAI (embeddings) and OpenRouter (LLM) with your key.
Pros
- Zero friction: Merchant installs the app and it works. No sign-up for OpenAI/OpenRouter, no key to copy.
- Control: You choose models per plan (e.g. free model for free tier, better model for paid), set rate limits, and keep quality consistent.
- Standard for Shopify apps: Most chat/AI apps (e.g. Tidio, Chatty) work this way; merchants expect “install and go.”
- Easier support: When something breaks, it’s your stack, not “your OpenAI account hit a limit.”
Cons
- You pay for all usage. You must price to cover cost + margin (see pricing model below).
- You need usage monitoring, caps per plan, and optional overage handling.

Use this as the default: Merchants get AI included; you absorb cost and recover it via subscription (and optional usage tiers).

Option 2: Customer adds their own key (BYOK — Bring Your Own Key)

How it works: In the admin config screen, merchant optionally enters their OpenAI or OpenRouter API key. Your backend uses their key for that store’s requests (embeddings and/or chat).
Pros
- You don’t pay for AI usage for those stores; no usage-based cost risk.
- Some power users or cost-sensitive merchants prefer to use their own key and billing.
Cons
- Friction: Many merchants don’t want to create an OpenAI/OpenRouter account or manage keys; installs drop.
- Support: “Chat stopped working” often means their key expired, hit limits, or ran out of credits; you’ll get more support load.
- Inconsistent experience: Different keys → different rate limits and reliability; harder to guarantee quality.

Use BYOK only as an optional advanced setting (e.g. “Use my own OpenRouter key”), not as the only way to use the app.

Recommendation

Primary: Use your OpenAI and OpenRouter keys for all stores. Price plans so that subscription revenue covers embedding + LLM cost + margin.
Optional: Offer “Bring your own key” in admin for stores that want to use their own OpenRouter (and optionally OpenAI) key; treat it as a power-user option and still charge a (lower) app subscription for the product value (widget, RAG pipeline, admin, analytics).

Use OpenRouter for embeddings too (single bill, one vendor)

OpenRouter supports embeddings. Use their API (POST https://openrouter.ai/api/v1/embeddings) with models like openai/text-embedding-3-small so all AI spend (embeddings + chat) goes through OpenRouter. One invoice, one place to manage credits and limits.
Per-client cost: OpenRouter does not split the invoice by client, but every response includes cost and token usage. In your backend, when you call OpenRouter (embeddings or chat), pass shop_id in your own context and log (shop_id, cost, tokens) from the response into your DB (e.g. a usage or openrouter_calls table). Then you know exactly what each store cost you for billing and margin analysis, even though you pay OpenRouter as one bill.
Plan A and Plan B in this doc are updated to use OpenRouter for both embeddings and chat.

8. Pricing Model (When You Use Your Key)

You are not charging for products. You are charging per message reply — i.e. per AI reply sent to the end user in the chat. Each time your backend returns one assistant message to a store visitor, that counts as one billable unit for the merchant.

Because you pay for embeddings (amortized) and LLM usage per reply, your pricing should:

Cover cost: Your cost per reply = embedding lookup (negligible once indexed) + LLM (input + output tokens). Price per reply above that cost + margin.
Leave margin: After cost per reply, leave room for hosting, support, and profit.
Keep it clear for merchants: “Pay per AI reply” is easy to explain; you can combine with a free tier (e.g. N replies included per month) and/or subscription + per-reply.

Chosen structure: free → subscription + add-on credits

Tier	How it works	Example
Free	50 replies per month included; use a free/cheap OpenRouter model. After 50, prompt to subscribe.	Free: 50 replies/month, then subscribe or buy add-on credits.
Paid (subscription)	$10/month = $10 AI credits (≈500 messages at $0.02/message). Users can purchase add-on AI credits when they need more.	$10/mo includes ~500 messages; add-on credit packs (e.g. one-time charge) for extra.
Add-on credits	Optional one-time charges for extra AI credits (e.g. “500 more messages — $10”), or usage-based overage on the subscription.	Sell credit packs; track balance and deduct per reply.

How to size the numbers

Your cost per reply:
- LLM: (input tokens + output tokens) × (OpenRouter $ per 1K tokens). Input = system prompt + RAG context + user message; output = one assistant reply.
- Embeddings: already done at ingest; retrieval is cheap. Amortize embedding cost across replies (no direct “per reply” embedding charge).
Price per reply: Set so (price per reply − cost per reply) gives you margin (e.g. 2–3× cost or 50–70% margin). Example: if your cost is ~$0.005 per reply, charge $0.01–0.02 per reply.
Free tier: Give a small number of included replies (e.g. 50/month) and use a free or very cheap model so free users don’t burn your margin.

Implementation notes

Count “replies” only: One billable unit = one assistant message returned to the user (do not charge for the user’s message). Store reply count per shop (e.g. in shops or a usage table) and increment on each successful AI response.
Shopify Billing API: For usage-based billing you can use recurring application charges and report usage (e.g. usage-based billing if available in your region), or sell prepaid reply packs (one-time or recurring charge for “500 replies”) and decrement as they’re used. Check Shopify’s current docs for usage-based and metered billing.
Admin: Show “Replies used this month” and “Remaining replies” (or “Pay per reply: X used”) in the config screen so merchants see usage clearly.

Summary

Use your key as the default; optional BYOK for advanced users.
Pricing model: Free = 50 replies/month. Paid = $10/month subscription (includes $10 AI credits ≈ 500 messages at $0.02/message). Users can purchase add-on AI credits as needed. Product limits are not the basis of pricing.

9. Next Steps After Choosing a Plan

Lock plan: Decide A, B, or C and document it (e.g. in this file or in my-rnd/readme.md).
Backend: Implement product sync and webhooks (see my-rnd/Setup Shopify & App.md), then ingestion pipeline and RAG (schema from Option-A-Phase-01-Design database schema for RAG.md).
Admin config screen: Build settings UI in the embedded app (widget style, welcome message, knowledge sources, plan limits).
Storefront: Theme App Extension loads chat UI; frontend calls your backend; backend does vector search + OpenRouter call and returns the reply.
Billing: Integrate Shopify Billing API; free = 50 replies/mo, then $10/mo subscription ($10 credits ≈ 500 messages) + add-on credit purchases. Track reply count and credits per shop; show usage in admin.

Once you pick A, B, or C, the rest of the product plan in docs/my-rnd (setup, schema, marketing) stays as-is; only the vector storage and, if needed, the exact embedding/retrieval calls change.

AppiFire AI Chat — Product Plan & Architecture Options (A, B, C)

1. Product Summary

What You Are Building

References (from docs/my-rnd)

2. Requirements (Recap)

3. Plan Options (A, B, C)

Plan A — Self-Hosted Vectors (Lowest Cost, Most Control)

Plan B — Managed Vector DB (Balanced: No Vector Ops)

Plan C — OpenAI-Centric (Fastest to Ship, Least Control)

4. Recommendation Summary

5. Infrastructure: RAM, CPU, and Will Plan A Slow Down the Server?

Plan A (pgvector on your server): what you need

Will Plan A slow down the server?

Plan B (managed vector DB): what you need

Plan A vs Plan B: infrastructure pros and cons

Switching between Plan A and Plan B (you can go either way)

6. What Stays the Same in All Plans

7. API Key Strategy: Your Key vs Customer Key

Option 1: You use your own OpenAI / OpenRouter key (recommended)

Option 2: Customer adds their own key (BYOK — Bring Your Own Key)

Recommendation

Use OpenRouter for embeddings too (single bill, one vendor)

8. Pricing Model (When You Use Your Key)

Chosen structure: free → subscription + add-on credits

How to size the numbers

Implementation notes

Summary

9. Next Steps After Choosing a Plan

References (from `docs/my-rnd`)