The challenge
The client is a DTC home-and-lifestyle brand that grew fast enough to break its support function. Two agents were fielding the volume of five, and the queue was almost entirely the same handful of questions: where is my order, how do I start a return, is this back in stock, what's your policy on X. First-response time averaged six hours, which on a Friday night meant a Monday morning reply, and CSAT was sliding as the backlog grew.
The founder had been pitched three "AI chatbot" tools and walked away from all of them for the same reason: every demo cheerfully made up order details. For a brand whose entire support load is factual, "where is order #4471," a bot that confidently invents a tracking number is worse than no bot at all. It does not reduce tickets; it manufactures angry ones.
What we found
Two weeks reading the queue confirmed the shape of the problem. Roughly 70% of tickets were answerable from data the brand already had: order status, return eligibility, product details, and policies. The blocker was never knowledge; it was that the answers lived in Shopify and a help center the customer would not dig through, and a human had to look them up one at a time.
That pointed at the constraint that ended up shaping the whole approach: we could not ship anything that guessed at order facts. The assistant had to ground every order-specific answer in live data, or refuse and escalate.
What we built
A retrieval-augmented (RAG) support assistant, gated behind an evaluation suite before a single customer touched it.
- Grounded retrieval: The assistant answers order and product questions by retrieving over the brand's live Shopify order and product data and the indexed help-center policies. It does not answer order-specific questions from the model's parameters; if the data is not retrieved, it does not assert the fact.
- Eval suite before launch: We built a test set of real-world support questions, including adversarial ones designed to bait a hallucination ("what's the tracking for an order that doesn't exist"). The assistant had to pass the eval gate, zero fabricated order facts, before launch, and the suite runs on every change so a model or prompt update cannot silently regress it.
- Integrations + escalation paths: Wired into Shopify for order and product data and into the helpdesk so conversations, tags, and history stay in one place. Clear escalation rules hand off to a human the moment the question leaves the assistant's grounded scope, refunds, damaged goods, anything sensitive, with full context attached so the customer never repeats themselves.
6 hours
Two agents, manual lookups per ticket
instant
Grounded assistant, humans handle escalations only
Results
(See the full headline-results grid at the top of this page.)
How the engagement ran
Ticket audit + eval design
Categorized 60 days of tickets. Confirmed ~70% were answerable from existing data. Built the eval set, including adversarial hallucination bait, before writing any assistant logic.
Retrieval + Shopify integration
Connected live Shopify order and product data, indexed the help-center policies, and built grounded retrieval so every order answer traces to a real record.
Eval gate + escalation paths
Ran the assistant against the eval suite until it hit zero fabricated order facts. Wired helpdesk integration and the human escalation rules for anything out of scope.
Shadow launch + go-live
Ran the assistant in shadow mode against live tickets, comparing its drafts to agent replies, before letting it respond. Went live once the agreement rate held.
The numbers
Inside 60 days, support tickets reaching a human dropped 38% as the assistant resolved the repetitive, data-backed questions outright. First response on those topics went from a six-hour average to instant. CSAT rose 12 points, and the number the founder cared about most held exactly where we promised: zero hallucinated order facts, enforced by the eval gate on every change. The two agents stopped triaging "where is my order" and started handling the cases that actually need a human.
Every other tool we tried would happily invent a tracking number. This one refuses to answer unless it can pull the real order, and the eval suite means a model update can't quietly start lying. My team finally works the tickets that need a person.
What we learned
The eval suite is what separates a useful support assistant from a liability. For a brand whose support load is factual, the launch criterion is not "does it sound helpful," it is "can it be made to lie, and does it refuse." Build the evals first, gate the launch on them, and run them on every change. The hallucination number is a promise you keep with tests, not hope.
Is your queue mostly questions your own data already answers, and have the chatbot demos all scared you off by inventing facts? That is exactly the problem we built this for. Start a conversation.