Building Voice AI Agents for Sales: The our voice AI platform Production Story
We went live on April 30, 2026. The voice agents started picking up sales leads. The sales team was not in the office — it was a Wednesday, and they had already signed off for the day. The agents handled it.
That moment felt like something, and not just because the demo worked. It felt like something because we'd been staring at this problem for months: Simplilearn, an edtech platform operating at scale, generates leads continuously. Sales teams operate on a schedule. The gap between lead arrival and first human contact was costing us conversions, and the loss was sharpest on weekends — Saturdays and Sundays, when the team was fully offline and leads were going cold.
The solution we shipped was a voice AI pipeline built on our voice AI platform, deployed serverless on AWS. This is the engineering story of how we got there.
Why Voice, and Why This Is Harder Than Text
The first question worth answering: why voice at all? Text bots for lead qualification exist. They're cheaper to run, easier to build, and simpler to debug.
The answer is conversion rate. Voice converts at a meaningfully different rate than text for this use case. A lead who just expressed intent — filled out a form, clicked an ad, inquired about a course — is in a short window of high intent. A phone call in that window feels like personal attention. A WhatsApp message feels like a queue. For an edtech sales motion where course costs are in the thousands, that perception gap matters.
So voice it was. But voice AI has constraints that text doesn't.
Latency is brutal. Human conversation has an implicit contract: when one person stops talking, the other responds within 300–500ms. Beyond that, the conversation starts to feel broken. For text, a 2-second LLM response is acceptable. For voice, it's disqualifying. Every component in the pipeline — speech-to-text, LLM inference, text-to-speech, network round trip — has to fit inside that budget. Collectively.
Silence and interruptions. Real phone calls have background noise, partial utterances, filler words, and interruptions. The agent needs to handle a caller who starts talking mid-sentence from the agent, who says "um, wait" three times before forming a question, who puts the call on hold and comes back. Text conversation is largely sequential; voice is not.
Conversation flow isn't a script. Early voice bots were literally scripted — if user says X, say Y. That breaks the moment a caller goes off-path, which is most of the time. A qualification conversation for an edtech lead might start with course interest and end with them asking about EMI options, refund policies, or a completely different course than they initially indicated. The agent needs to handle these pivots without losing the thread of what it was trying to accomplish.
These are not hypothetical problems. They were the actual engineering problems we worked through.
Why our voice AI platform
The infrastructure decision was our voice AI platform — an open-source voice AI platform. I want to explain the reasoning because "we used the open-source thing" sounds like a default choice, but it wasn't.
The managed voice AI services in 2025–2026 — the Synthflows and Vaapis and various others — offer fast time-to-demo but limited control. our voice AI platform is self-hosted, which means we own the pipeline. For a platform like Simplilearn, where every lead conversation is potentially sensitive sales data, having full control over where call transcripts are stored and how they're processed was a hard requirement.
our voice AI platform also gave us the ability to tune every component in the pipeline independently. The STT model, the LLM, the TTS voice, the VAD (voice activity detection) sensitivity — these are exposed as configuration rather than black boxes. When we had a latency problem in production, we could instrument and optimize each layer. With a managed service, you file a support ticket.
The tradeoff is operational burden. We own the infrastructure, which means we own the failures. That's a reasonable tradeoff at our scale and with our team's capability, but it's a real tradeoff.
The Pipeline: Lead to CRM Handoff
Here's how a lead flows through the system.
Inbound trigger. A lead comes in — form submission, ad click, whatever the acquisition source. This fires a serverless function on AWS Lambda. The function enriches the lead with basic context (course interest, geography, source attribution) and queues a call.
Call initiation. our voice AI platform picks up the queued call intent and dials the lead's number. This happens within a target window of a few minutes from lead creation. The speed-to-call metric was one of the key design constraints we tracked.
Qualification flow. The agent conducts a structured but flexible qualification conversation. The core objectives: confirm course interest, understand timeline, surface any blockers (price, schedule constraints, competitive evaluation), and gauge intent strength. The conversation is not scripted — the agent operates on an objective and conversation guidelines, not a decision tree. This is the part that required the most iteration.
Transcript logging. Every call is fully transcribed and logged. This isn't optional — it's how the sales team understands what happened before they pick up the conversation, and it's how we measure agent quality over time.
CRM handoff. At the end of a qualifying call, the agent creates or updates the CRM record with a structured summary: lead quality signal, key points from the conversation, recommended follow-up action, and a priority flag for the human sales rep. The rep gets context, not just a raw transcript.
For leads the agent couldn't qualify (disconnected, not interested, wrong number), the record is updated with that outcome and removed from the active queue.
The Saturday Problem, and What It Taught Us About Handoff Design
The original use case was specifically Saturday and Sunday leads — the highest-loss period, when sales were offline and leads were going cold over 48-hour windows.
This turned out to be a rich design constraint, not just a scheduling gap.
When the sales team is offline, the agent isn't just buying time — it's the only contact. There's no human to escalate to mid-call, no sales manager to patch in if something gets complicated. The agent has to handle the full range of what a real first-contact call looks like, or gracefully close the call with a clear callback commitment that the human team will actually honor on Monday.
That "graceful close" path taught us something important about AI-human handoff design. The handoff isn't just a data transfer — it's a commitment the AI makes on behalf of the human. When the agent says "a course advisor will call you Monday morning," the sales team has to actually call Monday morning, and the CRM record has to make it easy for them to pick up the conversation with context.
If that handoff is broken — if the rep calls without context, or calls too late, or the record is vague — the AI's work is wasted and the lead experience is worse than if no one had called at all. This forced us to be very deliberate about the CRM integration. The structured handoff summary format went through several iterations before the sales team said it actually helped them.
The Latency Work
The <500ms turn-around requirement was the hardest engineering constraint on the project.
We broke it down by component and measured each independently. STT latency, LLM inference time, TTS synthesis, audio delivery. The sum of the median latencies was within budget; the tail latencies were not. P95 response time was over a second in early testing.
The interventions:
Streaming TTS. Rather than waiting for the full LLM response before starting TTS, we streamed TTS from the first sentence of the LLM output. This required the LLM prompt to be structured so that useful, speakable content appeared early in the response.
LLM prompt tuning for brevity. Long responses are slow to generate and slow to speak. We tuned the agent prompts heavily toward concise responses — one thought at a time, no preambles, no "great question" filler. This also made the conversation feel more natural.
VAD sensitivity tuning. Getting the voice activity detection calibrated correctly was more impactful than I expected. Too sensitive, and the agent interrupts the caller. Too slow, and there are dead pauses after the caller finishes speaking. We spent a non-trivial amount of time on this.
Regional AWS deployment. The Lambda functions and the our voice AI platform infrastructure were deployed in a region with low latency to our primary lead geography. This is obvious in hindsight but easy to overlook when you're focused on application-layer optimization.
Where It Sits in the Broader Picture
The our voice AI platform pipeline is one piece of a broader AI-first tooling effort at Simplilearn. The 12× ROI figure for the AI stack as a whole is a real number, and voice agents for weekend lead coverage contribute to that by addressing a specific, quantifiable loss — leads that would otherwise go cold.
The next phase is expanding coverage to Skillup leads — Simplilearn's free-tier product. This is a different qualification motion: the leads are higher volume, lower average intent, and the conversion economics are different. The agent conversation flow will need to adapt accordingly. What works for a high-intent paid course inquiry doesn't map directly to a free-tier user who's casually exploring.
That expansion work is in design now.
What I'd Tell Someone Starting This Today
Voice AI for sales is genuinely useful, but don't underestimate how much of the work is in the details that don't show up in demos.
The demo works when the caller is cooperative, the audio is clean, and the conversation stays on path. Production is when the caller asks about something the agent wasn't designed to handle, or the call drops mid-qualification, or the CRM integration fails silently and the sales rep calls without any context.
Build the handoff design as carefully as the agent itself. Measure latency at every layer, not just end-to-end. And talk to the sales team early and often — they'll tell you what actually matters in the first conversation with a lead, and that will shape the qualification flow more than any prompt engineering you'll do on your own.
We went live April 30. The agents are handling real calls. That's not the end of the story — it's closer to the end of the beginning.
Shubham Gupta is a Tech Lead and Junior Associate Architect at Simplilearn. He builds AI-first engineering systems at scale.
[[Home]] | [[Content Hub]]