Option A — Phase 12: RAG Data Sync (Products, Collections, Inventory, Blog Posts) + Order Lookup

This document is the plan for extending RAG data sources beyond products: collections, inventory, and blog posts — all synced and stored in the DB for RAG. Orders are handled differently: not stored (to avoid unbounded DB growth); instead, when a customer asks about a specific order (e.g. by order number), we fetch that order on the fly from Shopify, inject it into the chat prompt, and let the AI answer. The plan also defines a Data Sync admin screen with per-resource fetch buttons and incremental sync.

Prerequisites: Phase 2 (product sync & ingestion) complete; existing Product, KnowledgeSource, KnowledgeDocument, KnowledgeChunk, Embedding schema and syncAllProducts / ingestProduct flow.

1. Objective

Fetch and store in DB (for RAG): products (existing), collections, inventory, blog posts. Incremental sync: last_synced_at per resource type; only new/updated records fetched and upserted; ingested into knowledge chunks and embeddings.
Orders — on-the-fly only (no storage): When the customer asks about an order (e.g. “What’s the status of order #1001?”), the chat flow detects or receives the order identifier, calls Shopify Admin API to fetch that order, injects the order details (status, line items, fulfillment, etc.) into the prompt, and the AI answers. No order table, no order sync, no order webhooks. Data is always current and the database does not grow with orders.
New admin screen “Data Sync”: Buttons to sync products, collections, inventory, blog posts (individually + “Fetch all”); display last sync datetime per type. No “Fetch orders” — orders are not synced.

2. Current Implementation (Reference)

2.1 Products (existing)

Trigger: afterAuth in app/shopify.server.js calls syncAllProducts({ shopId, admin }) fire-and-forget on install.
Flow: app/lib/product-sync.server.js: fetchAllShopifyProducts(admin) (GraphQL, cursor pagination) → upsertProduct(shopId, node) → ingestProduct(shopId, product, variants).
DB: products (with lastSyncedAt per row), product_variants; ingestion writes to knowledge_sources (type product), knowledge_documents, knowledge_chunks, embeddings.
Scopes: read_products, read_inventory (already in shopify.app.toml).

2.2 Ingestion pattern

One KnowledgeSource per shop per type (e.g. product, collection, article). Orders are not ingested; they are fetched on demand in chat.
KnowledgeDocument per logical document (e.g. one product, one collection, one article) with externalRefId = Shopify ID; rawText = concatenated chunkable text; hash for skip-if-unchanged.
KnowledgeChunk + Embedding per chunk; hash check avoids re-embedding unchanged content.
RAG-readable labels: Chunk text is explicitly labeled so the AI knows what each piece of context is: products use a “Product: [title]” line; collections use “Collection: [title]”; articles use “Article: [title]”. Tables (products, collections, inventory_levels, articles) store human-readable fields (titles, descriptions, item_label) so any future use (e.g. inventory ingestion) also has meaningful data.
Storefront URLs for AI to share: Products, collections, and articles each have a url field (built from shop domain + handle at sync time). KnowledgeDocument has source_url (set during ingestion from the entity’s url). Vector search returns source_url with each chunk; the prompt builder includes “Link: {url}” in the store context so the AI can provide product/collection/article links to the user in its reply.

3. Data Types: APIs and Permissions

Resource	Shopify API (GraphQL)	Required scope(s)	Stored in DB?	Notes
Products	`products` query	`read_products`	Yes	Already implemented; add incremental via `query` filter `updated_at:>X`.
Collections	`collections` query	`read_products`	Yes	Collections and their products; same scope as products.
Inventory	`inventoryItems` (with `variant.product.title`), `inventoryLevels`	`read_inventory`, `read_products`	Yes (optional)	Store levels with item_label (product/variant name or SKU) so AI knows “quantity of what”; “what’s in stock” RAG.
Orders	`orders` query (single by name/id)	`read_orders`	No	Fetched on the fly when customer asks about an order; injected into chat prompt. No DB storage.
Blog posts	`blogs`, `articles`	`read_content`	Yes	Already have scope; Article, Blog.

Scopes to add in shopify.app.toml (and .env SCOPES):

read_orders — for on-the-fly order lookup in chat only (no sync, no storage).

Already have: read_products, read_inventory, read_content. (Draft orders are not in scope for this plan; see note below.)

Note — What Shopify stores in draft orders: Draft orders are manually created orders in Shopify Admin (not from normal checkout). Merchants use them for: phone/email/in-person sales, custom or wholesale/B2B orders, pre-orders, discounted or wholesale pricing, and sending an invoice with a checkout link. Each draft can contain products, customer, taxes, shipping, discounts, tags; once the customer pays (or payment terms are set), it becomes a regular order. We are not syncing draft orders in this plan.

4. Database Design

4.1 Sync state (last sync per resource type per shop)

Store when each resource type was last synced so we can do incremental fetches (only records updated after that time).

Option A — New table shop_sync_state (recommended):

Column	Type	Purpose
`id`	UUID, PK
`shop_id`	UUID, FK
`resource_type`	String	`products` \| `collections` \| `inventory` \| `blogs` (no `orders` — orders are not synced)
`last_synced_at`	DateTime	When we last successfully finished a sync for this type.
`last_sync_status`	String?	e.g. `completed`, `failed`, `partial`
`last_sync_error`	Text?	Last error message if failed.
`updated_at`	DateTime

Unique: (shop_id, resource_type) so one row per shop per type.
Usage: Before sync, read last_synced_at; call Shopify with filter updated_at:>last_synced_at (or equivalent). After sync, update last_synced_at and status.

Option B — Columns on shops: e.g. last_products_sync_at, last_collections_sync_at, … Works but clutters Shop; Option A scales better.

4.2 New tables for raw data (for RAG and display)

All synced tables store meaningful, human-readable data so RAG and the AI can interpret context without relying on IDs alone:

Products (existing): products (title, description, product_type, vendor, tags, status, url), product_variants (title, price, sku, inventory_quantity). Sync fetches handle from Shopify and builds url as https://{shop_domain}/products/{handle}. Ingestion chunks are prefixed with “Product: [title]”; ingestion also sets knowledge_documents.source_url from product.url so the AI can share the link.
Collections: collections — shop_id, shopify_collection_id, title, description_html, url, last_synced_at. Sync fetches handle and builds url as https://{shop_domain}/collections/{handle}. Ingested text prefixed with “Collection: [title]”; document source_url set from collection.url.
Inventory: Dedicated inventory_levels table: shop_id, shopify_inventory_item_id, location_id, item_label (human-readable name, e.g. “Product Title – Variant” or SKU), quantity, last_synced_at. The item_label is required so the AI/RAG knows what each quantity refers to (“quantity of what”). Sync fetches variant/product title (or SKU) from Shopify and stores it in item_label; optional ingestion for “in stock” RAG.
Orders: Not stored. Fetched on demand in the chat flow (see § 3.1 below).
Blog posts: blogs — shop_id, shopify_blog_id, handle (for building article URLs), title. articles — shop_id, blog_id, shopify_article_id, title, body_html, url, published_at, last_synced_at. Article url is built as https://{shop_domain}/blogs/{blog_handle}/{article_handle}. Ingested text prefixed with “Article: [title]”; document source_url set from article.url.

4.3 Orders: on-the-fly lookup (no DB storage)

To avoid storing orders (which can grow very large), order status is provided to the chat agent on demand:

When the customer asks about an order: The chat endpoint (e.g. api.chat or the RAG flow) must know which order they mean. Options:
- A) Customer includes order name/number in the message (e.g. “What’s the status of order #1001?”). Parse the message (or use the LLM in a lightweight way) to extract the order identifier, then call Shopify Admin API to fetch that order by name or id.
- B) Storefront can pass orderId or orderName in the chat request when the customer is on an order status page or has entered their order number in a form.
Fetch from Shopify: Use GraphQL order(id: ...) or orders(first: 1, query: "name:ORDER_NAME") with the shop’s access token. Require read_orders scope.
Inject into prompt: Add the order details (status, fulfillments, line items, dates) as plain text or structured snippet to the context passed to the LLM (in addition to any RAG chunks). The AI then answers based on this current data.
No persistence: Do not write the order to any table. Each request gets fresh data from Shopify, so status is always current and the database stays small.

All of these should have a last_synced_at (or equivalent) at row level if we want row-level “only new/updated” from Shopify; alternatively we only use shop_sync_state.last_synced_at and request “all records updated since” from Shopify (GraphQL query filter on updated_at).

5. Incremental Sync Mechanism

First run (no last_synced_at): Full fetch (all pages) for that resource type; then set shop_sync_state.last_synced_at = now() (and status completed).
Subsequent runs: Read last_synced_at from shop_sync_state for that resource type. Call Shopify with a filter for “updated_at >= last_synced_at” (or “updated_at > last_synced_at” to avoid duplicates). GraphQL examples:
- Products: products(first: 50, query: "updated_at:>=LAST_SYNC_ISO") (and paginate).
- Collections: collections(first: 50, query: "updated_at:>=LAST_SYNC_ISO").
- Articles: blog.articles(first: 50, query: "updated_at:>=LAST_SYNC_ISO") per blog.
  (Orders are not synced; they are fetched on demand in chat.)
Upsert in DB: For each returned record, upsert by (shop_id, shopify_*_id); set last_synced_at on the row to the sync run time. Only new or updated documents are then passed to ingestion (hash check in ingestion can skip unchanged content).
After sync: Update shop_sync_state for this resource type: last_synced_at = now(), last_sync_status = 'completed' or 'failed', last_sync_error if failed.

6. New Admin Screen: “Data Sync” (or “RAG Data”)

Route: e.g. app/routes/app.data-sync.jsx (or app.sync.jsx).
Nav: Add a link in app/routes/app.jsx (e.g. “Data Sync” or “RAG Data”).
Loader: Load shop_sync_state for the current shop (all synced resource types); optionally counts per table (products, collections, articles, etc.) for display.
UI:
- One card per synced resource type: Products, Collections, Inventory, Blog posts. (No card for Orders — orders are fetched on demand in chat, not synced.)
- Each card: “Last synced: <datetime>” (or “Never”), optional status badge; “Fetch [Products]” (or “Sync products”) button.
- One “Fetch all” button that triggers sync for all stored types in sequence (or in parallel with care for rate limits).
Actions: POST actions (or fetcher) to trigger sync for a single type or “all”. Server-side: call the corresponding sync function (e.g. syncProducts, syncCollections, …); update shop_sync_state when done. Return success/error so the UI can refresh and show new last_synced_at.

7. Sync Functions (Server-Side)

Products: Reuse and extend syncAllProducts to support “incremental” (accept optional since date; pass query to GraphQL). New route action can call “full” or “incremental” based on shop_sync_state.
Collections: New app/lib/collection-sync.server.js: fetchAllShopifyCollections(admin, since?), upsertCollection(shopId, node), syncAllCollections({ shopId, admin }); ingest each collection into KnowledgeSource type collection (new ingestion helper or generic “ingestDocument(shopId, type, id, title, rawText)”).
Inventory: New app/lib/inventory-sync.server.js: fetch inventory items with variant/product title (or SKU), store in inventory_levels with item_label and quantity; optional ingestion so RAG can answer “Product X has Y in stock” (item_label tells the AI what each row’s quantity is for).
Orders: No sync. Instead, implement on-the-fly lookup in the chat flow: e.g. app/lib/order-lookup.server.js with fetchOrderByName(admin, orderName) or fetchOrderById(admin, orderId). Called from the chat API when the user message (or request params) indicates a specific order; result is injected into the prompt. Requires read_orders scope.
Blog posts: New app/lib/blog-sync.server.js: fetch blogs and their articles (with read_content), upsert to articles (and maybe blogs) table, ingest type article.

Each sync function should:

Read shop_sync_state.last_synced_at for that resource type (if incremental).
Fetch from Shopify (full or filtered by updated_at).
Upsert into the relevant DB table(s).
Optionally run ingestion for new/updated documents only (hash check in ingestion).
Update shop_sync_state (last_synced_at, status, error).

8. RAG Ingestion for New Types

Reuse existing pipeline: KnowledgeSource (one per shop per type: collection, inventory, article), KnowledgeDocument (one per collection/article/product with source_url for the storefront link), KnowledgeChunk + Embedding.
Add generic or type-specific ingest helpers (e.g. ingestCollection(shopId, collection), ingestArticle(shopId, article)) that build rawText, set document.source_url from the entity’s url (product.url, collection.url, article.url), compute hash, upsert document, delete old chunks/embeddings, create new chunks and call getEmbeddingsBatched, insert embeddings (same as ingestProduct pattern). When ingestion skips (unchanged hash), still update source_url if the entity has a url so existing documents get links.
Chunking: Reuse or extend chunker.server.js for non-product text (e.g. by character or sentence for articles). Orders are not ingested; they are added to the prompt on the fly when the user asks about a specific order.
Links in chat: Vector search joins knowledge_documents and returns source_url with each chunk. The prompt builder (prompt-builder.server.js) includes a “Link: {source_url}” line for each chunk when present, and the system prompt instructs the AI to include the link when mentioning a product, collection, or blog post so the user can click through.

9. Implementation Checklist

10. File Summary

Item	Action
`shopify.app.toml`	Add scope: `read_orders` (for on-the-fly order lookup only).
`.env` / `.env.sample`	Document SCOPES including new scope.
`prisma/schema.prisma`	Add `ShopSyncState`; add `Collection`, `Article` (and optional `Blog` with handle, `InventoryLevel` with `itemLabel`). Add url to Product, Collection, Article; add source_url to KnowledgeDocument. No Order or DraftOrder model.
`prisma/migrations/`	New migration(s) for above; plus migration for url/source_url and blog handle.
`app/lib/product-sync.server.js`	Extend for incremental (since, query); fetch handle, build and store product url; read/write `shop_sync_state` for products.
`app/lib/collection-sync.server.js`	Fetch handle, build and store collection url; upsert, sync state, ingest (with source_url).
`app/lib/inventory-sync.server.js`	New: fetch inventory with item_label, optional store, sync state.
`app/lib/order-lookup.server.js`	New: fetch single order from Shopify by name/id; used in chat flow to inject order into prompt. No DB write.
`app/lib/blog-sync.server.js`	Fetch blogs (with handle) and articles (with handle); build and store article url; upsert, sync state, ingest (with source_url).
`app/lib/ingestion.server.js`	Ingest helpers for product, collection, article; set knowledge_documents.source_url from entity url; when skipping (hash match), still update source_url if present.
`app/lib/vector-search.server.js`	Join knowledge_documents; return source_url with each chunk for prompt context.
`app/lib/prompt-builder.server.js`	For each chunk, include “Link: {source_url}” in store context when present; system prompt instructs AI to share links when mentioning products/collections/articles.
`app/lib/rag.server.js` (or chat route)	When message/request includes order identifier, call order lookup and add order details to prompt context.
`app/routes/app.data-sync.jsx`	New: Data Sync screen with loader, UI (cards for products, collections, inventory, blogs — no Orders), actions.
`app/routes/app.jsx`	Add nav link to Data Sync.

11. Summary

Topic	Plan
Stored in DB (RAG)	Products (existing), collections, inventory, blog posts.
Orders	Not stored. Fetched on the fly when the customer asks about a specific order (by number/name); order details injected into chat prompt; always current, no DB growth.
Permissions	Add `read_orders` (order lookup only); already have `read_products`, `read_inventory`, `read_content`.
Sync state	Table `shop_sync_state` (shop_id, resource_type, last_synced_at, status, error) for incremental sync; resource_type does not include “orders”.
Incremental	First run: full fetch; later: fetch only records with `updated_at > last_synced_at`; upsert in DB; then update last_synced_at.
UI	New “Data Sync” screen: one button per synced type (products, collections, inventory, blogs) + “Fetch all”; no “Fetch orders”.
RAG	Synced types flow into KnowledgeSource → KnowledgeDocument (with source_url) → KnowledgeChunk → Embedding. Vector search returns source_url; prompt includes “Link: …” so the AI can share product/collection/article URLs with the user. Orders are added to prompt on demand via order lookup.
Storefront links	Products, collections, and articles have url (built from shop domain + handle). knowledge_documents.source_url is set at ingestion and passed into the chat context so the AI can provide clickable links.

Next: implement migrations and sync state table; add new scopes; implement each sync module and ingestion; then build the Data Sync screen and “Fetch all” action.