Most companies adding AI in 2026 are not AI companies. You sell HVAC software, or run a supplements brand, or operate an accounting practice. You don't need a research lab. You need to bolt a few capable models onto a product that already works, without setting money on fire or shipping something that hallucinates a customer's invoice.
This is a guide for that. No transformer math. Just where AI earns its keep, how to build it so it doesn't embarrass you, and what it actually costs.
Where AI pays off (and where it doesn't)
After shipping AI features into a dozen non-AI products, the pattern is consistent. AI earns its keep in four places.
30-50%
Support ticket deflection (well-scoped)
2-4x
Faster intake / data entry
60-80%
Search relevance lift over keyword
5-15 hrs/wk
Saved by internal copilots per ops person
Support. Answering repetitive questions from your own docs and ticket history. This is the highest-ROI, lowest-risk starting point for almost everyone.
Intake and data entry. Turning a messy email, PDF, or voicemail into structured fields. Insurance claims, job requests, invoices, lead forms. Boring, high-volume, and AI is very good at it.
Search. Semantic search over your own content, so "the thing that stops the AC from short-cycling" finds the right article even when the words don't match.
Internal copilots. Tools your own team uses to draft, summarize, or look things up. Low stakes because a human reviews everything before it leaves the building.
Where it does not pay off yet: anything where a wrong answer is expensive and unrecoverable, anything you can solve with a SQL query and an if-statement, and anything you're adding because a competitor put "AI" in their headline. If you can't name the metric it moves, you don't have a feature, you have a press release.
RAG vs fine-tuning, in plain terms
These are the two ways to make a general model behave like it knows your business. People agonize over the choice. They shouldn't.
RAG (retrieval-augmented generation) means: when a question comes in, you look up the relevant documents from your own data and paste them into the prompt, then ask the model to answer using only those. The model's knowledge of your business lives in a database you control.
Fine-tuning means: you retrain the model's weights on examples so the behavior is baked in. The knowledge lives in the model itself.
The honest rule for 2026:
Start with RAG. Ninety percent of "we need to fine-tune a model on our data" turns out to be a retrieval problem, and retrieval is cheaper, faster, and updatable. Consider fine-tuning later, and only for narrow behavioral problems like consistent output formatting or a specialized classifier.
Build eval-first or don't build at all
This is the part teams skip, and it's the part that separates a feature you trust from a demo that breaks in production.
An eval is a test suite for non-deterministic output. Before you ship, you write 50 to 200 real examples with the answer you'd accept, then you score every change against them. When you tweak the prompt, swap the model, or change retrieval, the eval tells you whether you made it better or worse. Without it, you're tuning by vibes, and vibes don't survive contact with real users.
The minimum viable eval pipeline:
- Collect real inputs. Pull actual support tickets, real intake forms, genuine search queries. Synthetic examples lie to you.
- Define acceptance per example. Sometimes exact match, more often a rubric scored by a second model plus spot-checks by a human.
- Run it on every change. Wire it into CI so a prompt edit can't ship if it drops the score.
- Track a confidence floor. Below a threshold, the system says "I'm not sure, here's a human" instead of guessing.
The feature isn't done when the demo works. It's done when the eval is green and you'd bet payroll on the score.
We treat this the way we treat tests on any other system at our AI Chatbots & Agents practice, because that's exactly what it is.
What it actually costs
Numbers, because everyone dances around them. These are 2026 ballparks for a single, well-scoped feature on a small or mid-size product.
$15K-$45K
Build a scoped RAG support assistant
$200-$2K/mo
Inference + vector store at SMB volume
$0.30-$3
Per 1K typical RAG interactions
$8K-$25K
Eval + monitoring setup (one-time)
The build cost is mostly engineering, not API bills. At SMB and mid-market volumes, the monthly inference cost is almost never the thing that hurts. What hurts is shipping without evals, then paying engineers to firefight regressions you can't measure. Spend the money on the eval harness. It's the cheapest insurance you'll buy.
If your quote is dominated by GPU rental or "training," ask hard questions. For most non-AI companies, that's a sign someone is solving a problem you don't have.
The failure modes nobody warns you about
A few specifics worth internalizing.
Hallucination is a UX problem, not just a model problem. You can't eliminate it, so design around it: cite sources, show confidence, and make the handoff to a human one click. A model that says "I'm not certain" beats one that's confidently wrong.
Retrieval rot is the silent killer. Your assistant works great on launch day, then six months later it's quoting a pricing page you deleted. Re-index on a schedule and alert when the index is stale.
Prompt injection is real once you ingest untrusted text. If your RAG pulls from user-submitted content or scraped pages, treat that text as hostile. Never let retrieved content override your system instructions, and never give the model write access it doesn't strictly need.
Cost creep hides in retries. A naive retry-on-failure plus a chatty agent loop can turn a $300 month into a $4,000 month without a single new user. Set hard spend caps and log token counts per request from day one.
A sane first 30 days
If you're starting from zero, here's the sequence that works:
- Week 1. Pick one feature with a measurable metric. Support deflection is the usual winner. Collect 100 real examples.
- Week 2. Build the eval harness against those examples first. Yes, before the feature.
- Week 3. Ship a RAG prototype behind a flag, internal users only, every answer logged.
- Week 4. Tune against the eval, set the confidence floor and human handoff, then expose it to a small slice of real users.
That's it. One feature, eval-first, RAG before fine-tuning, costs you can predict. Do that and your second AI feature is easy, because the hard parts, evals and monitoring, are already in place.
If you want a second opinion on where AI fits in your specific product, send us a brief with the one workflow you'd most like to automate and what a wrong answer would cost you. That second number tells us more than the first.
Read next:
- Build vs buy: a framework for custom software
- What custom software actually costs in 2026
- How to scope an MVP that ships in six weeks
Related: