Building AI Video Synthesis Before It Was Cool

In mid-2022, I was a Lead Software Engineer at HappyCredit, a fintech-edtech startup, and we were trying to solve a very specific problem: how do you produce hundreds of personalized video testimonials at scale without a studio, a camera crew, or a budget that made finance uncomfortable?

The answer, it turned out, was to synthesize them.

This was before ChatGPT launched. Before "AI" became shorthand for large language models and chat interfaces. Before every software tool added an AI button to justify its pricing. We were working with DeepFaceLab, a Python-based face-swapping and face-reenactment toolkit that had built a cult following in the video production underground, and ElevenLabs, which was barely a product yet — they'd just opened early API access to their voice cloning technology.

What we built was janky, compute-hungry, inconsistent, and deeply impressive for the time. Here's the honest account of how it worked and what building it was actually like.

The Problem We Were Solving

HappyCredit worked with ed-tech clients who needed video content — specifically, the kind of social-proof testimonial content that converts browsers into buyers. Real student testimonials. Real faces, real voices, talking about real course outcomes.

Getting those videos at scale meant coordinating with actual students, scheduling recordings, dealing with lighting and audio quality, editing raw footage, and then going through the whole cycle again when a client needed a different language variant or a fresher face.

Nobody wanted to do this. It was expensive, slow, and the bottleneck was always human coordination.

Our hypothesis: if we had a library of consenting creator faces and could generate natural-sounding voice-overs programmatically, we could automate the bulk of this pipeline. A client tells us what claims they want the testimonial to make, we feed that through the synthesis pipeline, and they get a production-ready video in hours instead of weeks.

That was the pitch. The execution was considerably messier.

The Stack: DeepFaceLab + ElevenLabs + Python Glue

The architecture was three stages chained together:

Stage 1: Script generation and voice synthesis

We'd take the testimonial brief — usually a set of bullet points from the client about what they wanted communicated — and turn it into a natural-sounding script. In 2022, this was mostly manual plus some templating; GPT-3 existed but we didn't trust it enough for client-facing copy without heavy review.

Once we had the script, it went through the ElevenLabs API. The voice cloning worked by training a model on ~5 minutes of audio from the creator, which we collected once during onboarding. After that, any text you sent to the API would come back as audio in that creator's voice. The quality was genuinely remarkable — better than anything else available at the time. The main issues were prosody (emotional flatness on certain sentence structures) and occasional consonant mangling that required re-generation.

Output: an .mp3 of the creator, in their own voice, reading the testimonial script.

Stage 2: Lip sync via DeepFaceLab

This is where things got complicated.

DeepFaceLab was not designed to be a production pipeline component. It was designed for manual, high-effort face work by video artists who would spend hours tweaking settings for a single clip. We were trying to run it programmatically, at volume, on commodity GPU hardware.

The pipeline took a source video (a "neutral talking head" clip we recorded once per creator — usually 60–90 seconds of them speaking naturally) and drove the lip movements to match the synthesized audio track. The result was the creator appearing to say the synthesized words.

What actually happened was more variable than that.

On good runs — correct lighting match, clean audio, familiar phoneme patterns — the output was nearly indistinguishable from a real recording. On bad runs, you'd get ghosting around the mouth, unnatural jaw movement, or what we internally called "the uncanny valley flicker" — a frame-rate artifact that made the face feel slightly wrong even if you couldn't place why.

We spent a significant chunk of the first three months just improving yield rate: the percentage of synthesis runs that produced usable output without manual intervention. We built an automated quality-check layer using basic frame-by-frame variance analysis to flag obviously broken outputs before they reached human review.

Stage 3: Post-processing and delivery

The raw output from DeepFaceLab needed cleanup: color correction to match the original clip's grade, audio leveling, captioning for accessibility, and formatting for whatever platform the client was using. This stage was the most stable — standard FFmpeg tooling, nothing exotic.

The full pipeline, end-to-end, took between 2 and 4 hours of wall-clock time per video on our hardware setup. That sounds slow, but compared to the alternative — weeks of coordination to get a real recording — it was transformative for the clients who saw it working.

What Broke (The Honest List)

Compute was brutal. DeepFaceLab's inference was GPU-bound, and we were running on machines that were designed for development, not production ML workloads. We had a queue system that looked fine in testing and fell apart under real volume. We rebuilt the queue twice.

Lip sync timing drift. The audio synthesis and video generation were two separate processes that we stitched together. When the audio phrasing didn't match the phoneme timing in the source video, the sync would drift — sometimes subtly, sometimes in a way that made the output unusable. We built heuristics to detect this, but the real fix was being more disciplined about what source video we accepted from creators during onboarding.

Creator diversity. The model performed differently depending on skin tone, facial structure, and recording conditions. This wasn't acceptable to leave as-is, and fixing it required separate calibration work for each creator profile. We documented this honestly in our internal quality guide and it became a structured part of the creator onboarding checklist.

ElevenLabs rate limits. In 2022, ElevenLabs was an early product with API limits that weren't designed for production batch workloads. We hit ceilings at awkward times and had to build request queuing and retry logic that we hadn't planned for.

Client expectations vs. reality. The gap between "AI-generated video" as a concept (perfect, seamless, indistinguishable) and what we were actually producing (high-quality but not flawless) required active expectation management. We learned early to show clients examples before starting a batch, not after.

What Worked

The voice cloning genuinely worked. ElevenLabs in 2022 was already producing voice output that clients consistently found indistinguishable from the source creator. That was the foundation everything else relied on, and it held up.

The scale story worked. Once the pipeline was stable, the throughput was real. Producing a new video variant for a different language, tone, or talking point didn't require re-involving the creator. That flexibility was the thing clients valued most — the ability to iterate on messaging without a production cycle.

The related the video creation platform platform, which grew out of some of this infrastructure, reached 1,000+ creator adoptions and reduced content creation cycle time by 50%. The AI video synthesis work was foundational to that adoption — it gave creators a way to produce volume that would have been impossible manually.

What This Looked Like, Building It in 2022

There's something worth naming about the context of doing this work in 2022 that gets lost when people talk about "early AI" from a 2025 or 2026 vantage point.

In 2022, the dominant mental model of AI-generated video was still tied to deepfakes in the negative sense — manipulated political footage, non-consensual face swaps, misinformation. We were doing something different — consented, creator-owned, transparently synthetic — but we were working in a space that had no established norms yet.

We made explicit decisions about how to handle consent (written agreement from every creator, with documented use-case scope), how to label synthetic content for clients, and what use cases we'd decline. None of this was legally required. We did it because it was the right framework for the technology, and because we were close enough to it to see clearly what misuse would look like.

The LLM wave that started in late 2022 with ChatGPT and accelerated through 2023 shifted most of the "AI content" conversation to text. Video synthesis got quieter for a while before it came roaring back with more sophisticated tools. But the foundational problems — compute costs, quality control, ethical consent frameworks, pipeline reliability — don't go away just because the underlying models get better.

Why I Still Think About This Work

Building this in 2022 was hard in a way that modern AI tooling tries to paper over. There were no well-documented best practices. There was no community of people who had done this in production before. Every problem was new and required first-principles thinking.

That experience of working close to the model — understanding where it fails, building quality checks because you've personally seen every failure mode, making explicit architectural decisions because nobody had made them for you — shaped how I think about AI integration work now.

The tools are better. The models are better. The engineering discipline required to use them well hasn't changed.

Shubham Gupta is a Tech Lead (→ Junior Associate Architect) at Simplilearn. Previously Lead Software Engineer at HappyCredit (2022–2024), where he worked on AI video synthesis, platform infrastructure, and content creator tooling.

AI Video Synthesis at HappyCredit: Before the Hype Had a Name