Option A — Phase 12: RAG Data Sync (Products, Collections, Inventory, Blog Posts) + Order Lookup
This document is the plan for extending RAG data sources beyond products: collections, inventory, and blog posts — all synced and stored in the DB for RAG. Orders are handled differently: not stored (to avoid unbounded DB growth); instead, when a customer asks about a specific order (e.g. by order number), we fetch that order on the fly from Shopify, inject it into the chat prompt, and let the AI answer. The plan also defines a Data Sync admin screen with per-resource fetch buttons and incremental sync.
Prerequisites: Phase 2 (product sync & ingestion) complete; existing Product, KnowledgeSource, KnowledgeDocument, KnowledgeChunk, Embedding schema and syncAllProducts / ingestProduct flow.
1. Objective
Section titled “1. Objective”- Fetch and store in DB (for RAG): products (existing), collections, inventory, blog posts. Incremental sync:
last_synced_atper resource type; only new/updated records fetched and upserted; ingested into knowledge chunks and embeddings. - Orders — on-the-fly only (no storage): When the customer asks about an order (e.g. “What’s the status of order #1001?”), the chat flow detects or receives the order identifier, calls Shopify Admin API to fetch that order, injects the order details (status, line items, fulfillment, etc.) into the prompt, and the AI answers. No order table, no order sync, no order webhooks. Data is always current and the database does not grow with orders.
- New admin screen “Data Sync”: Buttons to sync products, collections, inventory, blog posts (individually + “Fetch all”); display last sync datetime per type. No “Fetch orders” — orders are not synced.
2. Current Implementation (Reference)
Section titled “2. Current Implementation (Reference)”2.1 Products (existing)
Section titled “2.1 Products (existing)”- Trigger:
afterAuthinapp/shopify.server.jscallssyncAllProducts({ shopId, admin })fire-and-forget on install. - Flow:
app/lib/product-sync.server.js:fetchAllShopifyProducts(admin)(GraphQL, cursor pagination) →upsertProduct(shopId, node)→ingestProduct(shopId, product, variants). - DB:
products(withlastSyncedAtper row),product_variants; ingestion writes toknowledge_sources(typeproduct),knowledge_documents,knowledge_chunks,embeddings. - Scopes:
read_products,read_inventory(already inshopify.app.toml).
2.2 Ingestion pattern
Section titled “2.2 Ingestion pattern”- One KnowledgeSource per shop per
type(e.g.product,collection,article). Orders are not ingested; they are fetched on demand in chat. - KnowledgeDocument per logical document (e.g. one product, one collection, one article) with
externalRefId= Shopify ID;rawText= concatenated chunkable text;hashfor skip-if-unchanged. - KnowledgeChunk + Embedding per chunk; hash check avoids re-embedding unchanged content.
- RAG-readable labels: Chunk text is explicitly labeled so the AI knows what each piece of context is: products use a “Product: [title]” line; collections use “Collection: [title]”; articles use “Article: [title]”. Tables (products, collections, inventory_levels, articles) store human-readable fields (titles, descriptions, item_label) so any future use (e.g. inventory ingestion) also has meaningful data.
- Storefront URLs for AI to share: Products, collections, and articles each have a url field (built from shop domain + handle at sync time). KnowledgeDocument has source_url (set during ingestion from the entity’s url). Vector search returns
source_urlwith each chunk; the prompt builder includes “Link: {url}” in the store context so the AI can provide product/collection/article links to the user in its reply.
3. Data Types: APIs and Permissions
Section titled “3. Data Types: APIs and Permissions”| Resource | Shopify API (GraphQL) | Required scope(s) | Stored in DB? | Notes |
|---|---|---|---|---|
| Products | products query | read_products | Yes | Already implemented; add incremental via query filter updated_at:>X. |
| Collections | collections query | read_products | Yes | Collections and their products; same scope as products. |
| Inventory | inventoryItems (with variant.product.title), inventoryLevels | read_inventory, read_products | Yes (optional) | Store levels with item_label (product/variant name or SKU) so AI knows “quantity of what”; “what’s in stock” RAG. |
| Orders | orders query (single by name/id) | read_orders | No | Fetched on the fly when customer asks about an order; injected into chat prompt. No DB storage. |
| Blog posts | blogs, articles | read_content | Yes | Already have scope; Article, Blog. |
Scopes to add in shopify.app.toml (and .env SCOPES):
read_orders— for on-the-fly order lookup in chat only (no sync, no storage).
Already have: read_products, read_inventory, read_content. (Draft orders are not in scope for this plan; see note below.)
Note — What Shopify stores in draft orders: Draft orders are manually created orders in Shopify Admin (not from normal checkout). Merchants use them for: phone/email/in-person sales, custom or wholesale/B2B orders, pre-orders, discounted or wholesale pricing, and sending an invoice with a checkout link. Each draft can contain products, customer, taxes, shipping, discounts, tags; once the customer pays (or payment terms are set), it becomes a regular order. We are not syncing draft orders in this plan.
4. Database Design
Section titled “4. Database Design”4.1 Sync state (last sync per resource type per shop)
Section titled “4.1 Sync state (last sync per resource type per shop)”Store when each resource type was last synced so we can do incremental fetches (only records updated after that time).
Option A — New table shop_sync_state (recommended):
| Column | Type | Purpose |
|---|---|---|
id | UUID, PK | |
shop_id | UUID, FK | |
resource_type | String | products | collections | inventory | blogs (no orders — orders are not synced) |
last_synced_at | DateTime | When we last successfully finished a sync for this type. |
last_sync_status | String? | e.g. completed, failed, partial |
last_sync_error | Text? | Last error message if failed. |
updated_at | DateTime |
- Unique:
(shop_id, resource_type)so one row per shop per type. - Usage: Before sync, read
last_synced_at; call Shopify with filterupdated_at:>last_synced_at(or equivalent). After sync, updatelast_synced_atand status.
Option B — Columns on shops: e.g. last_products_sync_at, last_collections_sync_at, … Works but clutters Shop; Option A scales better.
4.2 New tables for raw data (for RAG and display)
Section titled “4.2 New tables for raw data (for RAG and display)”All synced tables store meaningful, human-readable data so RAG and the AI can interpret context without relying on IDs alone:
- Products (existing):
products(title, description, product_type, vendor, tags, status, url),product_variants(title, price, sku, inventory_quantity). Sync fetcheshandlefrom Shopify and builds url ashttps://{shop_domain}/products/{handle}. Ingestion chunks are prefixed with “Product: [title]”; ingestion also sets knowledge_documents.source_url from product.url so the AI can share the link. - Collections:
collections— shop_id, shopify_collection_id, title, description_html, url, last_synced_at. Sync fetcheshandleand builds url ashttps://{shop_domain}/collections/{handle}. Ingested text prefixed with “Collection: [title]”; document source_url set from collection.url. - Inventory: Dedicated
inventory_levelstable: shop_id, shopify_inventory_item_id, location_id, item_label (human-readable name, e.g. “Product Title – Variant” or SKU), quantity, last_synced_at. The item_label is required so the AI/RAG knows what each quantity refers to (“quantity of what”). Sync fetches variant/product title (or SKU) from Shopify and stores it initem_label; optional ingestion for “in stock” RAG. - Orders: Not stored. Fetched on demand in the chat flow (see § 3.1 below).
- Blog posts:
blogs— shop_id, shopify_blog_id, handle (for building article URLs), title.articles— shop_id, blog_id, shopify_article_id, title, body_html, url, published_at, last_synced_at. Article url is built ashttps://{shop_domain}/blogs/{blog_handle}/{article_handle}. Ingested text prefixed with “Article: [title]”; document source_url set from article.url.
4.3 Orders: on-the-fly lookup (no DB storage)
Section titled “4.3 Orders: on-the-fly lookup (no DB storage)”To avoid storing orders (which can grow very large), order status is provided to the chat agent on demand:
- When the customer asks about an order: The chat endpoint (e.g.
api.chator the RAG flow) must know which order they mean. Options:- A) Customer includes order name/number in the message (e.g. “What’s the status of order #1001?”). Parse the message (or use the LLM in a lightweight way) to extract the order identifier, then call Shopify Admin API to fetch that order by
nameorid. - B) Storefront can pass
orderIdororderNamein the chat request when the customer is on an order status page or has entered their order number in a form.
- A) Customer includes order name/number in the message (e.g. “What’s the status of order #1001?”). Parse the message (or use the LLM in a lightweight way) to extract the order identifier, then call Shopify Admin API to fetch that order by
- Fetch from Shopify: Use GraphQL
order(id: ...)ororders(first: 1, query: "name:ORDER_NAME")with the shop’s access token. Requireread_ordersscope. - Inject into prompt: Add the order details (status, fulfillments, line items, dates) as plain text or structured snippet to the context passed to the LLM (in addition to any RAG chunks). The AI then answers based on this current data.
- No persistence: Do not write the order to any table. Each request gets fresh data from Shopify, so status is always current and the database stays small.
All of these should have a last_synced_at (or equivalent) at row level if we want row-level “only new/updated” from Shopify; alternatively we only use shop_sync_state.last_synced_at and request “all records updated since” from Shopify (GraphQL query filter on updated_at).
5. Incremental Sync Mechanism
Section titled “5. Incremental Sync Mechanism”- First run (no
last_synced_at): Full fetch (all pages) for that resource type; then setshop_sync_state.last_synced_at = now()(and statuscompleted). - Subsequent runs: Read
last_synced_atfromshop_sync_statefor that resource type. Call Shopify with a filter for “updated_at >= last_synced_at” (or “updated_at > last_synced_at” to avoid duplicates). GraphQL examples:- Products:
products(first: 50, query: "updated_at:>=LAST_SYNC_ISO")(and paginate). - Collections:
collections(first: 50, query: "updated_at:>=LAST_SYNC_ISO"). - Articles:
blog.articles(first: 50, query: "updated_at:>=LAST_SYNC_ISO")per blog.
(Orders are not synced; they are fetched on demand in chat.)
- Products:
- Upsert in DB: For each returned record, upsert by (shop_id, shopify_*_id); set
last_synced_aton the row to the sync run time. Only new or updated documents are then passed to ingestion (hash check in ingestion can skip unchanged content). - After sync: Update
shop_sync_statefor this resource type:last_synced_at = now(),last_sync_status = 'completed'or'failed',last_sync_errorif failed.
6. New Admin Screen: “Data Sync” (or “RAG Data”)
Section titled “6. New Admin Screen: “Data Sync” (or “RAG Data”)”- Route: e.g.
app/routes/app.data-sync.jsx(orapp.sync.jsx). - Nav: Add a link in
app/routes/app.jsx(e.g. “Data Sync” or “RAG Data”). - Loader: Load
shop_sync_statefor the current shop (all synced resource types); optionally counts per table (products, collections, articles, etc.) for display. - UI:
- One card per synced resource type: Products, Collections, Inventory, Blog posts. (No card for Orders — orders are fetched on demand in chat, not synced.)
- Each card: “Last synced: <datetime>” (or “Never”), optional status badge; “Fetch [Products]” (or “Sync products”) button.
- One “Fetch all” button that triggers sync for all stored types in sequence (or in parallel with care for rate limits).
- Actions: POST actions (or fetcher) to trigger sync for a single type or “all”. Server-side: call the corresponding sync function (e.g.
syncProducts,syncCollections, …); updateshop_sync_statewhen done. Return success/error so the UI can refresh and show new last_synced_at.
7. Sync Functions (Server-Side)
Section titled “7. Sync Functions (Server-Side)”- Products: Reuse and extend
syncAllProductsto support “incremental” (accept optionalsincedate; passqueryto GraphQL). New route action can call “full” or “incremental” based onshop_sync_state. - Collections: New
app/lib/collection-sync.server.js:fetchAllShopifyCollections(admin, since?),upsertCollection(shopId, node),syncAllCollections({ shopId, admin }); ingest each collection into KnowledgeSource typecollection(new ingestion helper or generic “ingestDocument(shopId, type, id, title, rawText)”). - Inventory: New
app/lib/inventory-sync.server.js: fetch inventory items with variant/product title (or SKU), store ininventory_levelswith item_label and quantity; optional ingestion so RAG can answer “Product X has Y in stock” (item_label tells the AI what each row’s quantity is for). - Orders: No sync. Instead, implement on-the-fly lookup in the chat flow: e.g.
app/lib/order-lookup.server.jswithfetchOrderByName(admin, orderName)orfetchOrderById(admin, orderId). Called from the chat API when the user message (or request params) indicates a specific order; result is injected into the prompt. Requiresread_ordersscope. - Blog posts: New
app/lib/blog-sync.server.js: fetch blogs and their articles (withread_content), upsert toarticles(and maybeblogs) table, ingest typearticle.
Each sync function should:
- Read
shop_sync_state.last_synced_atfor that resource type (if incremental). - Fetch from Shopify (full or filtered by updated_at).
- Upsert into the relevant DB table(s).
- Optionally run ingestion for new/updated documents only (hash check in ingestion).
- Update
shop_sync_state(last_synced_at, status, error).
8. RAG Ingestion for New Types
Section titled “8. RAG Ingestion for New Types”- Reuse existing pipeline: KnowledgeSource (one per shop per type:
collection,inventory,article), KnowledgeDocument (one per collection/article/product with source_url for the storefront link), KnowledgeChunk + Embedding. - Add generic or type-specific ingest helpers (e.g.
ingestCollection(shopId, collection),ingestArticle(shopId, article)) that buildrawText, set document.source_url from the entity’s url (product.url, collection.url, article.url), compute hash, upsert document, delete old chunks/embeddings, create new chunks and callgetEmbeddingsBatched, insert embeddings (same asingestProductpattern). When ingestion skips (unchanged hash), still update source_url if the entity has a url so existing documents get links. - Chunking: Reuse or extend
chunker.server.jsfor non-product text (e.g. by character or sentence for articles). Orders are not ingested; they are added to the prompt on the fly when the user asks about a specific order. - Links in chat: Vector search joins knowledge_documents and returns source_url with each chunk. The prompt builder (
prompt-builder.server.js) includes a “Link: {source_url}” line for each chunk when present, and the system prompt instructs the AI to include the link when mentioning a product, collection, or blog post so the user can click through.
9. Implementation Checklist
Section titled “9. Implementation Checklist”- Scopes: Add
read_orderstoshopify.app.tomland.envSCOPES (for on-the-fly order lookup only). - DB: shop_sync_state: Add Prisma model and migration; one row per (shop_id, resource_type). Types: products, collections, inventory, blogs (no orders, no draft_orders).
- DB: New tables: Add
collections,articles(noorderstable, nodraft_orders) (and optionallyblogswith handle,inventory_levelswith item_label) with shop_id, shopify_*_id, url (products, collections, articles), last_synced_at, and fields needed for RAG text. Add source_url toknowledge_documents. - Products incremental: Extend product sync to accept
sinceand use GraphQL query filter; read/writeshop_sync_stateforproducts. - Collection sync: Implement
collection-sync.server.js(fetch, upsert, ingest); wire toshop_sync_stateforcollections. - Inventory sync: Implement
inventory-sync.server.js(and optional table); wire toshop_sync_stateforinventory. - Order lookup (chat): Implement on-the-fly order fetch (e.g.
order-lookup.server.js); from chat API, when message/request includes order identifier, fetch order from Shopify and inject into prompt. No DB, no webhooks. - Blog sync: Implement
blog-sync.server.js(blogs + articles); wire toshop_sync_stateforblogs. - Data Sync screen: New route
app.data-sync.jsxwith loader (sync state + counts), UI (cards for products, collections, inventory, blogs — no Orders card), actions to trigger each sync and “Fetch all”. - Ingestion: Add ingest helpers for collection, article; register under corresponding KnowledgeSource types. No order ingestion.
- RAG retrieval: Ensure vector search and context building can use chunks from all source types (already per-shop; filter by source type if needed).
- Storefront URLs: Add url to products, collections, articles (build from shop domain + handle at sync); add handle to blogs (for article URL). Add source_url to knowledge_documents; set during ingestion from entity url; vector search returns source_url; prompt builder includes “Link: …” in context so the AI can share product/collection/article links with the user.
10. File Summary
Section titled “10. File Summary”| Item | Action |
|---|---|
shopify.app.toml | Add scope: read_orders (for on-the-fly order lookup only). |
.env / .env.sample | Document SCOPES including new scope. |
prisma/schema.prisma | Add ShopSyncState; add Collection, Article (and optional Blog with handle, InventoryLevel with itemLabel). Add url to Product, Collection, Article; add source_url to KnowledgeDocument. No Order or DraftOrder model. |
prisma/migrations/ | New migration(s) for above; plus migration for url/source_url and blog handle. |
app/lib/product-sync.server.js | Extend for incremental (since, query); fetch handle, build and store product url; read/write shop_sync_state for products. |
app/lib/collection-sync.server.js | Fetch handle, build and store collection url; upsert, sync state, ingest (with source_url). |
app/lib/inventory-sync.server.js | New: fetch inventory with item_label, optional store, sync state. |
app/lib/order-lookup.server.js | New: fetch single order from Shopify by name/id; used in chat flow to inject order into prompt. No DB write. |
app/lib/blog-sync.server.js | Fetch blogs (with handle) and articles (with handle); build and store article url; upsert, sync state, ingest (with source_url). |
app/lib/ingestion.server.js | Ingest helpers for product, collection, article; set knowledge_documents.source_url from entity url; when skipping (hash match), still update source_url if present. |
app/lib/vector-search.server.js | Join knowledge_documents; return source_url with each chunk for prompt context. |
app/lib/prompt-builder.server.js | For each chunk, include “Link: {source_url}” in store context when present; system prompt instructs AI to share links when mentioning products/collections/articles. |
app/lib/rag.server.js (or chat route) | When message/request includes order identifier, call order lookup and add order details to prompt context. |
app/routes/app.data-sync.jsx | New: Data Sync screen with loader, UI (cards for products, collections, inventory, blogs — no Orders), actions. |
app/routes/app.jsx | Add nav link to Data Sync. |
11. Summary
Section titled “11. Summary”| Topic | Plan |
|---|---|
| Stored in DB (RAG) | Products (existing), collections, inventory, blog posts. |
| Orders | Not stored. Fetched on the fly when the customer asks about a specific order (by number/name); order details injected into chat prompt; always current, no DB growth. |
| Permissions | Add read_orders (order lookup only); already have read_products, read_inventory, read_content. |
| Sync state | Table shop_sync_state (shop_id, resource_type, last_synced_at, status, error) for incremental sync; resource_type does not include “orders”. |
| Incremental | First run: full fetch; later: fetch only records with updated_at > last_synced_at; upsert in DB; then update last_synced_at. |
| UI | New “Data Sync” screen: one button per synced type (products, collections, inventory, blogs) + “Fetch all”; no “Fetch orders”. |
| RAG | Synced types flow into KnowledgeSource → KnowledgeDocument (with source_url) → KnowledgeChunk → Embedding. Vector search returns source_url; prompt includes “Link: …” so the AI can share product/collection/article URLs with the user. Orders are added to prompt on demand via order lookup. |
| Storefront links | Products, collections, and articles have url (built from shop domain + handle). knowledge_documents.source_url is set at ingestion and passed into the chat context so the AI can provide clickable links. |
Next: implement migrations and sync state table; add new scopes; implement each sync module and ingestion; then build the Data Sync screen and “Fetch all” action.