AI support for DTC: -38% tickets, eval-gated

Headline results

Support tickets
-38%
in 60 days
First response
6h→instant
for handled topics
CSAT
+12 pts
post-launch
Hallucinated order facts
0
eval-gated

The challenge

The client is a DTC home-and-lifestyle brand that grew fast enough to break its support function. Two agents were fielding the volume of five, and the queue was almost entirely the same handful of questions: where is my order, how do I start a return, is this back in stock, what's your policy on X. First-response time averaged six hours, which on a Friday night meant a Monday morning reply, and CSAT was sliding as the backlog grew.

The founder had been pitched three "AI chatbot" tools and walked away from all of them for the same reason: every demo cheerfully made up order details. For a brand whose entire support load is factual, "where is order #4471," a bot that confidently invents a tracking number is worse than no bot at all. It does not reduce tickets; it manufactures angry ones.

What we found

Two weeks reading the queue confirmed the shape of the problem. Roughly 70% of tickets were answerable from data the brand already had: order status, return eligibility, product details, and policies. The blocker was never knowledge; it was that the answers lived in Shopify and a help center the customer would not dig through, and a human had to look them up one at a time.

That pointed at the constraint that ended up shaping the whole approach: we could not ship anything that guessed at order facts. The assistant had to ground every order-specific answer in live data, or refuse and escalate.

What we built

A retrieval-augmented (RAG) support assistant, gated behind an evaluation suite before a single customer touched it.

Grounded retrieval: The assistant answers order and product questions by retrieving over the brand's live Shopify order and product data and the indexed help-center policies. It does not answer order-specific questions from the model's parameters; if the data is not retrieved, it does not assert the fact.
Eval suite before launch: We built a test set of real-world support questions, including adversarial ones designed to bait a hallucination ("what's the tracking for an order that doesn't exist"). The assistant had to pass the eval gate, zero fabricated order facts, before launch, and the suite runs on every change so a model or prompt update cannot silently regress it.
Integrations + escalation paths: Wired into Shopify for order and product data and into the helpdesk so conversations, tags, and history stay in one place. Clear escalation rules hand off to a human the moment the question leaves the assistant's grounded scope, refunds, damaged goods, anything sensitive, with full context attached so the customer never repeats themselves.

First-response time on the most common support topics (order status, returns, stock, policy)

Before · First response

6 hours

Two agents, manual lookups per ticket

After · First response

instant

Grounded assistant, humans handle escalations only

Results

(See the full headline-results grid at the top of this page.)

How the engagement ran

Week 1-2

Ticket audit + eval design

Categorized 60 days of tickets. Confirmed ~70% were answerable from existing data. Built the eval set, including adversarial hallucination bait, before writing any assistant logic.

Week 3-4

Retrieval + Shopify integration

Connected live Shopify order and product data, indexed the help-center policies, and built grounded retrieval so every order answer traces to a real record.

Week 5

Eval gate + escalation paths

Ran the assistant against the eval suite until it hit zero fabricated order facts. Wired helpdesk integration and the human escalation rules for anything out of scope.

Week 6-7

Shadow launch + go-live

Ran the assistant in shadow mode against live tickets, comparing its drafts to agent replies, before letting it respond. Went live once the agreement rate held.

The numbers

Inside 60 days, support tickets reaching a human dropped 38% as the assistant resolved the repetitive, data-backed questions outright. First response on those topics went from a six-hour average to instant. CSAT rose 12 points, and the number the founder cared about most held exactly where we promised: zero hallucinated order facts, enforced by the eval gate on every change. The two agents stopped triaging "where is my order" and started handling the cases that actually need a human.

Every other tool we tried would happily invent a tracking number. This one refuses to answer unless it can pull the real order, and the eval suite means a model update can't quietly start lying. My team finally works the tickets that need a person.

— Head of CX, high-growth DTC brand

What we learned

The eval suite is what separates a useful support assistant from a liability. For a brand whose support load is factual, the launch criterion is not "does it sound helpful," it is "can it be made to lie, and does it refuse." Build the evals first, gate the launch on them, and run them on every change. The hallucination number is a promise you keep with tests, not hope.

Is your queue mostly questions your own data already answers, and have the chatbot demos all scared you off by inventing facts? That is exactly the problem we built this for. Start a conversation.

Engagement led by Inparlor

Published May 2026 · 2 services · 7 weeks

AI Chatbots & Agents Process Automation

Related work

Other engagements that pair well with this one.

Want results like these?

Tell us about your business. We respond in 48 hours.

Pricing, scope, timeline, and the team that would actually run the engagement.

Get a proposal

How an eval-gated AI support assistant cut a DTC brand's tickets 38%