We built a support bot that cut tickets by 34%. Nobody cared about the model.
A client wanted GPT-4 on everything. What actually moved the needle was boring routing, clean docs, and knowing when to hand off to a human.
Last spring a B2B SaaS team came to us with a familiar brief: "We need an AI support agent. Something like Intercom Fin, but ours." They had 400+ tickets a week, mostly the same twelve questions about billing, seat limits, and SSO setup. Their CEO had already picked a model tier in a slide deck. We didn't argue on the first call.
Week one wasn't about prompts. It was about reading 600 closed tickets and tagging what actually happened. Turns out 41% were answered by linking to two help articles. Another 22% needed a human because the user was angry, confused about a custom contract, or both. The rest were genuine product bugs misfiled as support. That last bucket wasn't an AI problem at all.
The unglamorous stuff that worked
We indexed their docs — not the whole marketing site, just the help centre and internal runbooks engineers already trusted. We added a hard rule: if confidence on retrieval was low, the bot says "I'm not sure" and opens a ticket with context attached. No hallucinated refund policies. You'd be surprised how many production bots skip that step because it looks weak in a demo.
- Intent routing before generation — billing vs technical vs "I want to talk to someone"
- Citations on every answer that wasn't a simple greeting
- A one-click escalate button that pre-filled what the user already tried
- Weekly review of conversations the bot abandoned (that's where the product gaps live)
We shipped on a smaller model than they'd budgeted for. Nobody on their team noticed in blind tests. What they noticed was average first-response time dropping from four hours to under two minutes for the routable stuff.
What we'd do differently
We should have involved their finance lead earlier. The bot kept getting asked about proration on mid-cycle plan changes, and the help article was wrong. Fixing copy took one afternoon; fixing trust after three wrong answers took longer.
The model was maybe 15% of the outcome. The other 85% was permissions, retrieval, and being honest when we didn't know.
If you're scoping something similar, start with ticket taxonomy, not model benchmarks. Build the escalation path first. Demo the failure cases on purpose. Stakeholders remember the one wrong answer, not the ninety-nine correct ones.
We're not anti-ambition — we build agent systems for a living. But the wins we've seen in production look more like well-run operations with a language model attached, not the other way around.