Skip to content

Option A — Phase 12: RAG Data Sync (Products, Collections, Inventory, Blog Posts) + Order Lookup

This document is the plan for extending RAG data sources beyond products: collections, inventory, and blog posts — all synced and stored in the DB for RAG. Orders are handled differently: not stored (to avoid unbounded DB growth); instead, when a customer asks about a specific order (e.g. by order number), we fetch that order on the fly from Shopify, inject it into the chat prompt, and let the AI answer. The plan also defines a Data Sync admin screen with per-resource fetch buttons and incremental sync.

Prerequisites: Phase 2 (product sync & ingestion) complete; existing Product, KnowledgeSource, KnowledgeDocument, KnowledgeChunk, Embedding schema and syncAllProducts / ingestProduct flow.


  • Fetch and store in DB (for RAG): products (existing), collections, inventory, blog posts. Incremental sync: last_synced_at per resource type; only new/updated records fetched and upserted; ingested into knowledge chunks and embeddings.
  • Orders — on-the-fly only (no storage): When the customer asks about an order (e.g. “What’s the status of order #1001?”), the chat flow detects or receives the order identifier, calls Shopify Admin API to fetch that order, injects the order details (status, line items, fulfillment, etc.) into the prompt, and the AI answers. No order table, no order sync, no order webhooks. Data is always current and the database does not grow with orders.
  • New admin screen “Data Sync”: Buttons to sync products, collections, inventory, blog posts (individually + “Fetch all”); display last sync datetime per type. No “Fetch orders” — orders are not synced.

  • Trigger: afterAuth in app/shopify.server.js calls syncAllProducts({ shopId, admin }) fire-and-forget on install.
  • Flow: app/lib/product-sync.server.js: fetchAllShopifyProducts(admin) (GraphQL, cursor pagination) → upsertProduct(shopId, node)ingestProduct(shopId, product, variants).
  • DB: products (with lastSyncedAt per row), product_variants; ingestion writes to knowledge_sources (type product), knowledge_documents, knowledge_chunks, embeddings.
  • Scopes: read_products, read_inventory (already in shopify.app.toml).
  • One KnowledgeSource per shop per type (e.g. product, collection, article). Orders are not ingested; they are fetched on demand in chat.
  • KnowledgeDocument per logical document (e.g. one product, one collection, one article) with externalRefId = Shopify ID; rawText = concatenated chunkable text; hash for skip-if-unchanged.
  • KnowledgeChunk + Embedding per chunk; hash check avoids re-embedding unchanged content.
  • RAG-readable labels: Chunk text is explicitly labeled so the AI knows what each piece of context is: products use a “Product: [title]” line; collections use “Collection: [title]”; articles use “Article: [title]”. Tables (products, collections, inventory_levels, articles) store human-readable fields (titles, descriptions, item_label) so any future use (e.g. inventory ingestion) also has meaningful data.
  • Storefront URLs for AI to share: Products, collections, and articles each have a url field (built from shop domain + handle at sync time). KnowledgeDocument has source_url (set during ingestion from the entity’s url). Vector search returns source_url with each chunk; the prompt builder includes “Link: {url}” in the store context so the AI can provide product/collection/article links to the user in its reply.

ResourceShopify API (GraphQL)Required scope(s)Stored in DB?Notes
Productsproducts queryread_productsYesAlready implemented; add incremental via query filter updated_at:>X.
Collectionscollections queryread_productsYesCollections and their products; same scope as products.
InventoryinventoryItems (with variant.product.title), inventoryLevelsread_inventory, read_productsYes (optional)Store levels with item_label (product/variant name or SKU) so AI knows “quantity of what”; “what’s in stock” RAG.
Ordersorders query (single by name/id)read_ordersNoFetched on the fly when customer asks about an order; injected into chat prompt. No DB storage.
Blog postsblogs, articlesread_contentYesAlready have scope; Article, Blog.

Scopes to add in shopify.app.toml (and .env SCOPES):

  • read_orders — for on-the-fly order lookup in chat only (no sync, no storage).

Already have: read_products, read_inventory, read_content. (Draft orders are not in scope for this plan; see note below.)

Note — What Shopify stores in draft orders: Draft orders are manually created orders in Shopify Admin (not from normal checkout). Merchants use them for: phone/email/in-person sales, custom or wholesale/B2B orders, pre-orders, discounted or wholesale pricing, and sending an invoice with a checkout link. Each draft can contain products, customer, taxes, shipping, discounts, tags; once the customer pays (or payment terms are set), it becomes a regular order. We are not syncing draft orders in this plan.


4.1 Sync state (last sync per resource type per shop)

Section titled “4.1 Sync state (last sync per resource type per shop)”

Store when each resource type was last synced so we can do incremental fetches (only records updated after that time).

Option A — New table shop_sync_state (recommended):

ColumnTypePurpose
idUUID, PK
shop_idUUID, FK
resource_typeStringproducts | collections | inventory | blogs (no orders — orders are not synced)
last_synced_atDateTimeWhen we last successfully finished a sync for this type.
last_sync_statusString?e.g. completed, failed, partial
last_sync_errorText?Last error message if failed.
updated_atDateTime
  • Unique: (shop_id, resource_type) so one row per shop per type.
  • Usage: Before sync, read last_synced_at; call Shopify with filter updated_at:>last_synced_at (or equivalent). After sync, update last_synced_at and status.

Option B — Columns on shops: e.g. last_products_sync_at, last_collections_sync_at, … Works but clutters Shop; Option A scales better.

4.2 New tables for raw data (for RAG and display)

Section titled “4.2 New tables for raw data (for RAG and display)”

All synced tables store meaningful, human-readable data so RAG and the AI can interpret context without relying on IDs alone:

  • Products (existing): products (title, description, product_type, vendor, tags, status, url), product_variants (title, price, sku, inventory_quantity). Sync fetches handle from Shopify and builds url as https://{shop_domain}/products/{handle}. Ingestion chunks are prefixed with “Product: [title]”; ingestion also sets knowledge_documents.source_url from product.url so the AI can share the link.
  • Collections: collections — shop_id, shopify_collection_id, title, description_html, url, last_synced_at. Sync fetches handle and builds url as https://{shop_domain}/collections/{handle}. Ingested text prefixed with “Collection: [title]”; document source_url set from collection.url.
  • Inventory: Dedicated inventory_levels table: shop_id, shopify_inventory_item_id, location_id, item_label (human-readable name, e.g. “Product Title – Variant” or SKU), quantity, last_synced_at. The item_label is required so the AI/RAG knows what each quantity refers to (“quantity of what”). Sync fetches variant/product title (or SKU) from Shopify and stores it in item_label; optional ingestion for “in stock” RAG.
  • Orders: Not stored. Fetched on demand in the chat flow (see § 3.1 below).
  • Blog posts: blogs — shop_id, shopify_blog_id, handle (for building article URLs), title. articles — shop_id, blog_id, shopify_article_id, title, body_html, url, published_at, last_synced_at. Article url is built as https://{shop_domain}/blogs/{blog_handle}/{article_handle}. Ingested text prefixed with “Article: [title]”; document source_url set from article.url.

4.3 Orders: on-the-fly lookup (no DB storage)

Section titled “4.3 Orders: on-the-fly lookup (no DB storage)”

To avoid storing orders (which can grow very large), order status is provided to the chat agent on demand:

  1. When the customer asks about an order: The chat endpoint (e.g. api.chat or the RAG flow) must know which order they mean. Options:
    • A) Customer includes order name/number in the message (e.g. “What’s the status of order #1001?”). Parse the message (or use the LLM in a lightweight way) to extract the order identifier, then call Shopify Admin API to fetch that order by name or id.
    • B) Storefront can pass orderId or orderName in the chat request when the customer is on an order status page or has entered their order number in a form.
  2. Fetch from Shopify: Use GraphQL order(id: ...) or orders(first: 1, query: "name:ORDER_NAME") with the shop’s access token. Require read_orders scope.
  3. Inject into prompt: Add the order details (status, fulfillments, line items, dates) as plain text or structured snippet to the context passed to the LLM (in addition to any RAG chunks). The AI then answers based on this current data.
  4. No persistence: Do not write the order to any table. Each request gets fresh data from Shopify, so status is always current and the database stays small.

All of these should have a last_synced_at (or equivalent) at row level if we want row-level “only new/updated” from Shopify; alternatively we only use shop_sync_state.last_synced_at and request “all records updated since” from Shopify (GraphQL query filter on updated_at).


  • First run (no last_synced_at): Full fetch (all pages) for that resource type; then set shop_sync_state.last_synced_at = now() (and status completed).
  • Subsequent runs: Read last_synced_at from shop_sync_state for that resource type. Call Shopify with a filter for “updated_at >= last_synced_at” (or “updated_at > last_synced_at” to avoid duplicates). GraphQL examples:
    • Products: products(first: 50, query: "updated_at:>=LAST_SYNC_ISO") (and paginate).
    • Collections: collections(first: 50, query: "updated_at:>=LAST_SYNC_ISO").
    • Articles: blog.articles(first: 50, query: "updated_at:>=LAST_SYNC_ISO") per blog.
      (Orders are not synced; they are fetched on demand in chat.)
  • Upsert in DB: For each returned record, upsert by (shop_id, shopify_*_id); set last_synced_at on the row to the sync run time. Only new or updated documents are then passed to ingestion (hash check in ingestion can skip unchanged content).
  • After sync: Update shop_sync_state for this resource type: last_synced_at = now(), last_sync_status = 'completed' or 'failed', last_sync_error if failed.

6. New Admin Screen: “Data Sync” (or “RAG Data”)

Section titled “6. New Admin Screen: “Data Sync” (or “RAG Data”)”
  • Route: e.g. app/routes/app.data-sync.jsx (or app.sync.jsx).
  • Nav: Add a link in app/routes/app.jsx (e.g. “Data Sync” or “RAG Data”).
  • Loader: Load shop_sync_state for the current shop (all synced resource types); optionally counts per table (products, collections, articles, etc.) for display.
  • UI:
    • One card per synced resource type: Products, Collections, Inventory, Blog posts. (No card for Orders — orders are fetched on demand in chat, not synced.)
    • Each card: “Last synced: <datetime>” (or “Never”), optional status badge; “Fetch [Products]” (or “Sync products”) button.
    • One “Fetch all” button that triggers sync for all stored types in sequence (or in parallel with care for rate limits).
  • Actions: POST actions (or fetcher) to trigger sync for a single type or “all”. Server-side: call the corresponding sync function (e.g. syncProducts, syncCollections, …); update shop_sync_state when done. Return success/error so the UI can refresh and show new last_synced_at.

  • Products: Reuse and extend syncAllProducts to support “incremental” (accept optional since date; pass query to GraphQL). New route action can call “full” or “incremental” based on shop_sync_state.
  • Collections: New app/lib/collection-sync.server.js: fetchAllShopifyCollections(admin, since?), upsertCollection(shopId, node), syncAllCollections({ shopId, admin }); ingest each collection into KnowledgeSource type collection (new ingestion helper or generic “ingestDocument(shopId, type, id, title, rawText)”).
  • Inventory: New app/lib/inventory-sync.server.js: fetch inventory items with variant/product title (or SKU), store in inventory_levels with item_label and quantity; optional ingestion so RAG can answer “Product X has Y in stock” (item_label tells the AI what each row’s quantity is for).
  • Orders: No sync. Instead, implement on-the-fly lookup in the chat flow: e.g. app/lib/order-lookup.server.js with fetchOrderByName(admin, orderName) or fetchOrderById(admin, orderId). Called from the chat API when the user message (or request params) indicates a specific order; result is injected into the prompt. Requires read_orders scope.
  • Blog posts: New app/lib/blog-sync.server.js: fetch blogs and their articles (with read_content), upsert to articles (and maybe blogs) table, ingest type article.

Each sync function should:

  1. Read shop_sync_state.last_synced_at for that resource type (if incremental).
  2. Fetch from Shopify (full or filtered by updated_at).
  3. Upsert into the relevant DB table(s).
  4. Optionally run ingestion for new/updated documents only (hash check in ingestion).
  5. Update shop_sync_state (last_synced_at, status, error).

  • Reuse existing pipeline: KnowledgeSource (one per shop per type: collection, inventory, article), KnowledgeDocument (one per collection/article/product with source_url for the storefront link), KnowledgeChunk + Embedding.
  • Add generic or type-specific ingest helpers (e.g. ingestCollection(shopId, collection), ingestArticle(shopId, article)) that build rawText, set document.source_url from the entity’s url (product.url, collection.url, article.url), compute hash, upsert document, delete old chunks/embeddings, create new chunks and call getEmbeddingsBatched, insert embeddings (same as ingestProduct pattern). When ingestion skips (unchanged hash), still update source_url if the entity has a url so existing documents get links.
  • Chunking: Reuse or extend chunker.server.js for non-product text (e.g. by character or sentence for articles). Orders are not ingested; they are added to the prompt on the fly when the user asks about a specific order.
  • Links in chat: Vector search joins knowledge_documents and returns source_url with each chunk. The prompt builder (prompt-builder.server.js) includes a “Link: {source_url}” line for each chunk when present, and the system prompt instructs the AI to include the link when mentioning a product, collection, or blog post so the user can click through.

  • Scopes: Add read_orders to shopify.app.toml and .env SCOPES (for on-the-fly order lookup only).
  • DB: shop_sync_state: Add Prisma model and migration; one row per (shop_id, resource_type). Types: products, collections, inventory, blogs (no orders, no draft_orders).
  • DB: New tables: Add collections, articles (no orders table, no draft_orders) (and optionally blogs with handle, inventory_levels with item_label) with shop_id, shopify_*_id, url (products, collections, articles), last_synced_at, and fields needed for RAG text. Add source_url to knowledge_documents.
  • Products incremental: Extend product sync to accept since and use GraphQL query filter; read/write shop_sync_state for products.
  • Collection sync: Implement collection-sync.server.js (fetch, upsert, ingest); wire to shop_sync_state for collections.
  • Inventory sync: Implement inventory-sync.server.js (and optional table); wire to shop_sync_state for inventory.
  • Order lookup (chat): Implement on-the-fly order fetch (e.g. order-lookup.server.js); from chat API, when message/request includes order identifier, fetch order from Shopify and inject into prompt. No DB, no webhooks.
  • Blog sync: Implement blog-sync.server.js (blogs + articles); wire to shop_sync_state for blogs.
  • Data Sync screen: New route app.data-sync.jsx with loader (sync state + counts), UI (cards for products, collections, inventory, blogs — no Orders card), actions to trigger each sync and “Fetch all”.
  • Ingestion: Add ingest helpers for collection, article; register under corresponding KnowledgeSource types. No order ingestion.
  • RAG retrieval: Ensure vector search and context building can use chunks from all source types (already per-shop; filter by source type if needed).
  • Storefront URLs: Add url to products, collections, articles (build from shop domain + handle at sync); add handle to blogs (for article URL). Add source_url to knowledge_documents; set during ingestion from entity url; vector search returns source_url; prompt builder includes “Link: …” in context so the AI can share product/collection/article links with the user.

ItemAction
shopify.app.tomlAdd scope: read_orders (for on-the-fly order lookup only).
.env / .env.sampleDocument SCOPES including new scope.
prisma/schema.prismaAdd ShopSyncState; add Collection, Article (and optional Blog with handle, InventoryLevel with itemLabel). Add url to Product, Collection, Article; add source_url to KnowledgeDocument. No Order or DraftOrder model.
prisma/migrations/New migration(s) for above; plus migration for url/source_url and blog handle.
app/lib/product-sync.server.jsExtend for incremental (since, query); fetch handle, build and store product url; read/write shop_sync_state for products.
app/lib/collection-sync.server.jsFetch handle, build and store collection url; upsert, sync state, ingest (with source_url).
app/lib/inventory-sync.server.jsNew: fetch inventory with item_label, optional store, sync state.
app/lib/order-lookup.server.jsNew: fetch single order from Shopify by name/id; used in chat flow to inject order into prompt. No DB write.
app/lib/blog-sync.server.jsFetch blogs (with handle) and articles (with handle); build and store article url; upsert, sync state, ingest (with source_url).
app/lib/ingestion.server.jsIngest helpers for product, collection, article; set knowledge_documents.source_url from entity url; when skipping (hash match), still update source_url if present.
app/lib/vector-search.server.jsJoin knowledge_documents; return source_url with each chunk for prompt context.
app/lib/prompt-builder.server.jsFor each chunk, include “Link: {source_url}” in store context when present; system prompt instructs AI to share links when mentioning products/collections/articles.
app/lib/rag.server.js (or chat route)When message/request includes order identifier, call order lookup and add order details to prompt context.
app/routes/app.data-sync.jsxNew: Data Sync screen with loader, UI (cards for products, collections, inventory, blogs — no Orders), actions.
app/routes/app.jsxAdd nav link to Data Sync.

TopicPlan
Stored in DB (RAG)Products (existing), collections, inventory, blog posts.
OrdersNot stored. Fetched on the fly when the customer asks about a specific order (by number/name); order details injected into chat prompt; always current, no DB growth.
PermissionsAdd read_orders (order lookup only); already have read_products, read_inventory, read_content.
Sync stateTable shop_sync_state (shop_id, resource_type, last_synced_at, status, error) for incremental sync; resource_type does not include “orders”.
IncrementalFirst run: full fetch; later: fetch only records with updated_at > last_synced_at; upsert in DB; then update last_synced_at.
UINew “Data Sync” screen: one button per synced type (products, collections, inventory, blogs) + “Fetch all”; no “Fetch orders”.
RAGSynced types flow into KnowledgeSource → KnowledgeDocument (with source_url) → KnowledgeChunk → Embedding. Vector search returns source_url; prompt includes “Link: …” so the AI can share product/collection/article URLs with the user. Orders are added to prompt on demand via order lookup.
Storefront linksProducts, collections, and articles have url (built from shop domain + handle). knowledge_documents.source_url is set at ingestion and passed into the chat context so the AI can provide clickable links.

Next: implement migrations and sync state table; add new scopes; implement each sync module and ingestion; then build the Data Sync screen and “Fetch all” action.