Skip to content

Order Lookup Reliability Guide

This guide explains the current order-lookup reliability logic in plain language, including the retry + token-refresh behavior and the main edge cases to watch.


When a shopper asks for order status:

  1. api.chat resolves the shop and picks a token source (Session table first, then Shop table).
  2. generateChatReply in app/lib/rag.server.js decides whether this turn is:
    • order intent without number -> asks for order number
    • order lookup turn with parsed reference -> fetches live order
    • regular RAG chat
  3. If order lookup is needed, it calls fetchOrderByShopResilient in app/lib/order-lookup.server.js.
  4. fetchOrderByShopResilient:
    • retries auth failures with backoff
    • re-reads latest token candidates each attempt
    • can trigger refresh-token exchange on 401/403
  5. On success: order context is formatted and included in assistant reply.
  6. On failure: user gets safe fallback message (auth vs generic API issue vs not found).

State is tracked in ChatSession.metadata:

  • awaitingOrderNumber
  • orderAskRetries

Behavior:

  • If shopper says “where is my order” (no order number), assistant asks for order number.
  • If shopper replies with #1001 or 1001, it is accepted as a direct follow-up order ref.
  • If order API auth fails, session remains in order-follow-up context so the next #1001 is still handled as an order lookup turn.
  • On terminal outcomes (success or not found), order-follow-up state is cleared.

This prevents the flow from drifting into generic RAG after a transient auth issue.


fetchOrderByShopResilient currently does:

  • Backoff attempts: 0ms, 300ms, 900ms, 1800ms
  • Candidate token order per attempt:
    1. current request token
    2. request fallback token
    3. latest offline Session token
    4. latest Shop token
  • For each candidate, it calls fetchOrderByShop.
  • If all candidates in an attempt fail with auth error, it can call refreshAccessToken callback and try refreshed token.

Implemented in:

  • rag.server.js (chat path)
  • api.order-lookup-test.jsx (API test path)
  • scripts/test-order-lookup.mjs (CLI script path)

Refresh logic:

  1. Read latest offline Session row.
  2. Use refreshToken (if present and not expired).
  3. Call Shopify token endpoint:
    • POST https://{shopDomain}/admin/oauth/access_token
    • grant_type: refresh_token
  4. Persist new access token to:
    • Session row (accessToken, optional rotated refresh fields)
    • Shop row (accessToken)
  5. Retry order lookup immediately.

These are the main scenarios where it can still fail:

  • Missing/expired refreshToken in offline Session
    Why it fails: auto-refresh cannot run without a valid refresh token.
    Fix:

    1. Re-auth the app once from Shopify Admin to regenerate offline Session credentials.
    2. Add a daily health check job: flag shops where offline Session has missing/expired refreshToken.
    3. In-app admin banner: if order lookup hits this condition, prompt merchant to “Reconnect Shopify”.
    4. Keep fallback behavior user-safe: customer sees temporary order lookup error, not raw auth details.
  • Revoked token / reauth required by merchant
    Why it fails: Shopify rejects both access token and refresh token after revocation/scope change/uninstall-reinstall paths.
    Fix:

    1. Detect repeated 401 after refresh attempt and mark shop auth status as degraded (internal flag or log signal).
    2. Show merchant-facing reconnect CTA in app admin.
    3. Block repeated blind retries after hard-auth failure (avoid useless load and noisy logs).
    4. After successful reauth, immediately test one known order via script or test endpoint.
  • App uninstalled or scopes changed
    Token invalid even after retries; lookup fails until reauth.

  • Older orders outside default read_orders window
    Why it fails: read_orders usually limits visibility; older orders may appear as not found even when they exist.
    Fix:

    1. Confirm order age in Shopify Admin when lookup returns not found with no auth error.
    2. If business use-case truly needs historical orders, request read_all_orders with clear justification for App Review.
    3. Add user-facing fallback copy: “I can only access recent orders right now. Please contact support for older orders.”
    4. Keep current query strategy (name:#1001, name:1001, #1001) to maximize match quality for accessible orders.
  • Shopify-side transient API issues / throttling
    Why it fails: temporary 5xx, 429, or network instability can outlive local retries.
    Fix:

    1. Add jitter to retry backoff (for example 300-500ms, 900-1300ms) to reduce synchronized spikes.
    2. Treat 429 separately: read throttle hints/headers where available and delay accordingly.
    3. Log per-attempt status code and elapsed time when ORDER_LOOKUP_DEBUG=true.
    4. Keep graceful shopper fallback and avoid exposing “rate limit” internals in widget replies.
  • Concurrent refresh races (rare under load)
    Why it fails: two requests may refresh simultaneously and temporarily use mixed token states.
    Fix:

    1. Add a short-lived per-shop refresh lock (DB/advisory lock or in-memory mutex in single-node env).
    2. If lock exists, second request waits briefly and re-reads Session token before retrying.
    3. Use idempotent update policy: always persist latest refreshed token to both Session and Shop.
    4. Keep retry loop in place even with lock, because cross-instance races can still occur.

Use this order:

  1. Run script:
    • node scripts/test-order-lookup.mjs --shop=<shop> --orderRef=#1001
  2. If error is 401:
    • confirm Session has valid refreshToken
    • confirm app API key/secret env vars exist
  3. If found: false with no auth error:
    • verify order ref format and existence in Shopify Admin
    • verify scope window (read_orders vs old order age)
  4. If intermittent:
    • retry once and inspect server logs for refresh success
    • consider adding temporary debug logs around refresh attempts

If you want this to be production-strong at scale, implement in this order:

  1. Per-shop refresh lock around refresh-token exchange.
  2. Auth degradation signal (internal status/event) after repeated 401 + failed refresh.
  3. Ops telemetry: attempt count, token source, refresh success/failure reason.
  4. Merchant reconnect UX in admin for degraded auth state.
  5. Optional scope expansion to read_all_orders only if product requirements require old-order access.

  • Core fetch + resilience:
    • app/lib/order-lookup.server.js
  • Conversation orchestration:
    • app/lib/rag.server.js
  • API test endpoint:
    • app/routes/api.order-lookup-test.jsx
  • CLI test script:
    • scripts/test-order-lookup.mjs