Building Multi-Agent Customer Service with CrewAI: What We Learned in Production

There's a version of this story I could tell where everything goes smoothly. Where the agents communicate flawlessly, tickets resolve themselves, and we ship metrics with clean charts. That version is not this story.

In mid-2023, when LangChain was still finding its footing and CrewAI was freshly out of the gate, we started building a multi-agent customer service system at HappyCredit — a fintech platform running at 10,000+ concurrent users with 99.9% uptime requirements. We were handling a volume of support tickets that the human team couldn't keep pace with, and I wanted to see whether agents could actually absorb that load in a meaningful way.

This is what we built, what broke, and what I'd do differently now.

Why Multi-Agent, Not a Single Chain

The first instinct when you're building something with LLMs for customer service is to write a single chain: take the ticket, classify it, route it, generate a response. One prompt pipeline, done.

We tried that. It doesn't hold up.

The issue is that a single chain becomes fragile the moment the task complexity grows. When you ask one LLM call to simultaneously understand intent, look up account context, apply business rules, and draft a response — it starts to fail on edge cases in ways that are hard to debug. The outputs become inconsistent. You can't improve intent classification without inadvertently breaking response quality.

Multi-agent architecture forces separation of concerns in a way that's actually useful. Each agent has a narrow job and can be improved in isolation. That was the mental model that pushed us toward CrewAI.

The Crew: Three Agents, One Pipeline

The production setup had three agents working in sequence.

Agent 1: Intent Classifier This agent's only job was to read the incoming support ticket and output a structured intent label. We kept this deliberately narrow — around 15 intent categories covering things like payment failures, reward redemption issues, account access, and general queries. The agent had access to a few-shot prompt with curated examples, and nothing else. No tools, no retrieval, just classification.

Agent 2: Router The router received the classified intent and matched it against a routing ruleset — which team queue to assign, what priority level to apply, and whether this ticket could be auto-resolved or needed human review. It also queried a lightweight internal API to pull basic account metadata so the resolution agent had context.

Agent 3: Resolution Drafter The resolution agent had the hardest job. It received the original ticket text, the intent label, the routing decision, and the account context, then drafted a response. For auto-resolvable categories — like confirming a transaction status or resetting a password link — it could produce the full response. For anything requiring human judgment (disputes, fraud flags, complex refund logic), it generated a draft that a human agent would review and send.

The three agents ran as a CrewAI crew with sequential task execution and explicit context passing between steps.

What Actually Broke

I'll be direct: the first two months in production were humbling.

Hallucination on edge cases. The resolution drafter would occasionally confabulate account details — stating a transaction had been processed when it hadn't, or citing a policy that didn't exist. This was worst at the boundary of what the few-shot prompt had seen. We eventually mitigated this by adding a strict grounding instruction: the resolution agent was only permitted to use facts present in the context object passed to it, with explicit instructions not to infer or extrapolate. This helped significantly but never went to zero.

Inter-agent context degradation. When passing context between agents, we initially used free-form text summaries. By the time the resolution agent got its input, some nuance from the original ticket had been lost or reframed by upstream agents. The fix was moving to structured JSON context objects with explicit field contracts between agents. If the router didn't populate a required field, the resolution agent raised an error rather than guessing.

OpenAI rate limits. Running three LLM calls per ticket, at production volume during peak hours, put us hard against API rate limits. We had not sized this correctly. The solution involved two things: aggressive caching for the classifier (many tickets have highly similar intent patterns, so we cached embeddings and did vector similarity before making an API call), and a queuing layer that smoothed burst traffic. The queuing actually improved average resolution latency too, because it prevented the thundering-herd problem.

CrewAI orchestration overhead. In 2023, CrewAI's crew execution model introduced non-trivial latency between agent steps. For an async support pipeline, this was acceptable. But it's something to be aware of if you're building anything close to real-time.

Where the 25% Improvement Actually Came From

We measured a 25% reduction in average customer service resolution time. I want to be honest about where that came from, because it's not the story you'd expect.

The biggest contributor was triage speed, not response generation. Before the agents, tickets sat in a queue waiting for a human to read, classify, and assign them. The classifier and router collapsed that wait time significantly. Tickets were pre-classified and priority-sorted before a human ever touched them. That alone drove a large portion of the improvement.

The second contributor was auto-resolution coverage. For a meaningful subset of high-volume, low-complexity tickets — password resets, transaction confirmations, basic reward balance queries — the agents handled end-to-end resolution. Removing those from the human queue gave human agents more time per complex ticket, which improved quality on the hard cases.

What contributed less than expected? Draft quality on complex tickets. Human agents often rewrote the resolution agent's drafts substantially. The drafts weren't bad, but they were generic in ways that didn't match HappyCredit's voice. This is a prompt engineering and fine-tuning problem we didn't fully solve.

What I'd Do Differently Now

Invest earlier in evals. We were flying mostly blind on agent output quality for the first few months. Building a structured evaluation set — even 200–300 labeled examples across intent categories — would have let us catch regressions when we changed prompts. This is table stakes for any production LLM system and I underweighted it.

Use structured outputs from day one. OpenAI's function calling / structured output APIs were maturing rapidly in 2023-2024. Moving to strict output schemas earlier would have reduced the inter-agent context degradation problem significantly.

Design for human-in-the-loop explicitly. The handoff between agent output and human review was treated as an afterthought initially. It needed to be a first-class part of the design — clear UI for agents, easy overrides, explicit escalation paths with audit trail.

Consider fewer, smarter agents. The three-agent split made sense for our problem, but I've seen teams over-architect agent crews when a well-structured two-step pipeline would do. The overhead of each agent hop — latency, context serialization, potential information loss — is real. Don't add agents for elegance; add them when the separation of concerns genuinely improves the system.

What This Era Taught Me About Multi-Agent Systems

We built this before "agentic AI" became a buzzword. There was no playbook. The CrewAI docs were sparse, the community was small, and most of what we learned came from production failures.

The honest lesson is that multi-agent orchestration is a distributed systems problem more than it's an AI problem. The same things that make distributed systems hard — state management, failure modes, observability, latency accumulation — make multi-agent systems hard. The LLM component adds new failure modes (hallucination, prompt sensitivity), but the infrastructure challenges are fundamentally familiar.

If you're building this now, you have better tooling, better models, better structured output primitives. Use them. But the core discipline — narrow contracts between components, structured state passing, observable execution, explicit human escalation paths — that part doesn't change with the tooling.

The 25% improvement in resolution time was real. So was every broken thing we had to fix to get there.

Shubham Gupta is a Tech Lead and Junior Associate Architect at Simplilearn, formerly Lead Software Engineer at HappyCredit. He writes about building AI systems in production.

[[Home]] | [[Content Hub]]

// RELATED

01APPLIED AI

MAY 2026 · 14 MIN→

How We Evaluated and Adopted Agentic Coding Across Our Engineering Org

CLAUDE-CODEMULTI-AGENTDEVTOOLSWORKFLOWS

We Put CrewAI Agents in Our Support Pipeline. Here's What Production Actually Taught Us.