Rewriting Simplilearn's Architecture: Two Monorepos, One Direction

I joined Simplilearn as a senior engineer and was promoted to Tech Lead last year. In April 2026, I moved into a Junior Associate Architect role. The promotion came with a nicer title, but the work that earned it was less glamorous: years of archaeological excavation through two large legacy codebases, and the slow, careful work of migrating them toward something better.

This is the honest account of that migration — what we inherited, the decisions we made, how we run the new systems in production, and the considerable amount of work still in flight.

What We Inherited

Every large engineering org has legacy. Simplilearn's was not unusual in kind, only in scale.

The legacy frontend app is our legacy React monolith — a frontend application that started as a straightforward customer-facing site and accumulated 8+ years of features, experiments, and rewrites-that-weren't-quite-rewrites. The codebase had no clear module boundaries, inconsistent state management patterns (Redux, Context API, and direct API calls all handling similar state in different parts of the same app), and a deployment process that was slow enough that engineers worked around it in ways that created more debt.

The legacy PHP backend is the server-side complement — cron jobs, internal APIs, integrations with third-party systems, and business logic touched by dozens of engineers over nearly a decade. The kind of codebase where you change one thing and discover three other things that were implicitly depending on the behavior you just modified.

Both codebases served a platform that handles 1M+ organic monthly visits. They worked. They kept working. That's the thing about legacy systems that makes migration hard — they're not broken. They just make every change more expensive than it should be, and the cost compounds quietly until you're spending 60% of your sprint on maintenance for a feature that should take a day.

The combined technical debt was large enough that patching wasn't a viable strategy. We needed a new direction.

The Decision: Two Monorepos, Turborepo, Next.js + NestJS

Architecture decisions at this scale are rarely clean. You're not choosing between a clearly right answer and a clearly wrong one. You're choosing between trade-offs you understand and trade-offs you can't yet see.

Two Monorepos, Not One

We could have put everything in a single monorepo. We didn't, for a reason that sounds organizational but is actually architectural: the customer-facing applications and the internal/backend applications have fundamentally different deployment cycles, dependency surfaces, and team ownership patterns.

Customer-facing apps have strict deployment gates. A failing test or broken build on a back-office tool should not block a hotfix to the course catalog page serving live traffic. In a single monorepo, pipeline coupling is almost inevitable — your turbo.json graph ends up with transitive dependencies that pull unrelated apps into your critical path.

The frontend monorepo holds the five Next.js apps that users interact with directly. They change frequently, they have to be fast and reliable for external users.

The backend monorepo holds 10+ internal applications and services — NestJS APIs, internal tools, the business logic layer the customer-facing apps consume.

Separating them means teams can own their domain without coupling deployments. The frontend monorepo is owned by product engineering. The backend monorepo is owned by platform engineering. Each team gets clean CI ownership, separate remote cache namespaces, and the ability to tune pipeline configuration without negotiating with the other team.

The practical downside is managing two sets of shared tooling. We handle this with a shared internal package registry for common configs (ESLint, TypeScript base configs, design tokens). The duplication cost is lower than the coordination cost of a unified repo.

Turborepo as the Organizational Standard

Before this migration, we didn't have a consistent monorepo tooling story. We evaluated options and landed on Turborepo. The technical reasons are well-documented — remote caching, intelligent task orchestration, incremental builds. The organizational reason is equally important: it's opinionated enough that teams don't spend time making local tooling decisions, but flexible enough to accommodate real differences between apps.

Making Turborepo the org-wide standard means new apps get the caching and orchestration benefits by default. It means onboarding engineers already know what to expect. It means we're building on a shared foundation instead of a collection of bespoke build setups.

Next.js for Customer-Facing, NestJS for Internal

The Next.js choice for customer-facing apps was made partly for performance (the rendering flexibility of Next.js is valuable at our scale) and partly for the ecosystem and hiring market. We're building for 1M+ monthly visitors. Performance matters from day one.

The NestJS choice for internal services was made for structure. Express is flexible, which is another way of saying it's permissive, which is another way of saying teams make inconsistent architectural decisions that create maintenance debt. NestJS's module system and dependency injection enforce patterns that keep large service codebases navigable. The teams that switched from the legacy PHP backend to NestJS consistently reported that new feature development felt faster — not because NestJS is inherently faster, but because the structure made it possible to understand the codebase quickly.

The Frontend Monorepo: Running in Production

Five Next.js apps, each serving distinct product areas. Shared packages for UI components, API clients, analytics instrumentation, and config. The package graph is intentionally shallow — we resist putting business logic into shared packages because it creates exactly the coupling we split the monorepos to avoid.

Local vs. CI: Different Animals

Locally, Turborepo's caching is straightforward. Hashes computed from inputs, outputs stored, cache hits fast. Developers get the wins and rarely think about the mechanics.

CI is adversarial by comparison. Ephemeral runners with no persistent disk, multiple jobs running in parallel against the same remote cache, varied environment variables that affect hashes, and strict requirements around cache correctness — a stale cache in production is a bug, not an inconvenience.

The remote cache that works beautifully in local development becomes a distributed systems problem in CI. We had production incidents related to cache behavior before we understood the failure modes well enough to prevent them.

Remote Cache Setup

We use Turborepo's remote caching with a self-hosted cache server backed by S3 — we evaluated Vercel's hosted option but needed audit logging and access controls that required self-hosting. Deployed on ECS with a persistent EFS mount for hot cache and S3 as the cold tier.

Two non-obvious requirements we learned the hard way:

Branch-namespaced cache keys. We namespace remote cache keys by branch — main caches are isolated from feature branch caches. This prevents a feature branch with a broken build from poisoning the main branch cache.

code

turbo run build --cache-dir=.turbo --remote-cache-team=branch-${GITHUB_REF_SLUG}

Cache warming on main. We run a cache-warming job on every merge to main. This ensures the first CI run after a merge gets full cache benefit, not a cold start.

Cache Poisoning Incidents

Incident 1: Non-deterministic test. Our test suite had a flaky test — sometimes passed, sometimes failed based on timing. Turborepo cached the passing output and replayed it on subsequent runs, masking the flakiness for two weeks. We discovered it when the cached output was invalidated by an unrelated change and the test failed visibly.

The fix: --no-cache on all test tasks in CI. Tests are fast enough in our setup that the cache benefit didn't justify the risk of masking failures.

Incident 2: Environment variable bleed. A CI step set NODE_ENV=test in a way that propagated to the build task's environment. Cached build artifacts were built with NODE_ENV=test but deployed as production. Turborepo's hash didn't include NODE_ENV because we hadn't listed it in the task's env configuration.

The fix: explicit env declarations in turbo.json for every variable that affects build output. The failure mode is silent — the cache gives you the wrong artifacts without warning.

code

{
  "tasks": {
    "build": {
      "dependsOn": ["^build"],
      "inputs": ["src/**", "package.json", "tsconfig.json"],
      "outputs": [".next/**"],
      "env": ["NODE_ENV", "NEXT_PUBLIC_API_URL", "NEXT_PUBLIC_CDN_URL"]
    }
  }
}

Docker Layer Optimization

The biggest CI win we unlocked wasn't caching — it was Docker layer optimization with pruning. Turborepo's prune command generates a minimal monorepo subset for a single app, containing only the packages it actually depends on.

code

turbo prune --scope=@simpliturbo/course-catalog --docker

This generates an out/ directory with a pruned package.json and lockfile. In Docker, we copy this pruned lockfile first, run npm ci, then copy the pruned source. The result: Docker layer cache hits on the npm ci layer survive across deploys as long as dependencies don't change — which is most deploys.

We went from a 12-minute Docker build to 3 minutes on dependency-only changes. For customer-facing apps where hotfixes need to ship fast, this matters.

Parallelism Tuning

GitHub Actions runners have 2 vCPUs. Running all five apps plus packages in parallel on 2 vCPUs doesn't parallelize — it context-switches. We tuned --concurrency=4 (slightly above vCPU count to overlap I/O wait) and got better wall-clock times than the default.

For resource-intensive tasks (type checking, linting), we run --concurrency=2 in a separate CI step. Overlapping two TypeScript compiler processes on a 2-core machine is about twice as fast as running them sequentially; more than two doesn't help.

The Backend Monorepo: Internal Systems at Scale

Before the backend monorepo, internal apps were scattered across separate repositories. Releasing a change to a shared internal component meant opening PRs in multiple repos, coordinating merges, bumping package versions, and hoping nothing went stale. It was a coordination tax that grew with every new tool we shipped.

The monorepo pitch was practical: one PR, one pipeline, atomic changes across the internal system. Turborepo's remote caching was what made it viable — we weren't interested in paying CI costs proportional to the number of apps when most of them hadn't changed.

The turbo.json We Learned to Write

Our turbo.json started simple and evolved through painful experience. The key insight we missed initially: cache keys are computed from inputs you declare. Declare them wrong and you either get too many cache misses or, worse, stale cache hits that silently break builds.

code

{
  "tasks": {
    "build": {
      "dependsOn": ["^build"],
      "inputs": ["src/**", "package.json", "tsconfig.json"],
      "outputs": ["dist/**"],
      "env": ["NODE_ENV", "API_BASE_URL", "DATABASE_URL"]
    },
    "test": {
      "dependsOn": ["build"],
      "inputs": ["src/**", "**/*.test.ts", "jest.config.ts"],
      "outputs": [],
      "cache": false
    }
  }
}

"cache": false on tests. After the non-deterministic test incident on the frontend monorepo, we applied the same rule here from day one.

Additional Caching Gotchas

Non-deterministic outputs. Some internal apps included build timestamps or git SHAs in their output. Turborepo can't cache these consistently because the outputs change every build. We stripped all non-deterministic data from build artifacts and moved it to runtime config injection.

Local cache accumulation. The .turbo directory accumulates state. On ephemeral CI runners this is fine. On developer machines it can grow to several gigabytes and start causing unexpected behavior — especially after package renames, which are common in a growing internal systems repo. .turbo belongs in your periodic cleanup scripts.

Pipeline Topology: Where Architecture Decisions Live

The dependsOn graph is where architecture decisions live. Get it wrong and you serialize work that could run in parallel. Get it too aggressive and you run tasks with stale dependencies.

We mapped our dependency graph early:

code

turbo run build --dry=json

The output visualizes which tasks Turborepo schedules in parallel and which block on each other. We found three package cycles nobody had documented — packages that depended on each other through transitive paths, forcing serial execution across large sections of the pipeline until we broke the cycles.

The rule we settled on: no circular dependencies, enforced with an ESLint import plugin at the root workspace level. Violations fail CI. In practice it just makes the dependency graph an explicit engineering decision instead of an accident.

Flaky Test Quarantine

The hardest cultural change was handling flaky tests across a shared pipeline. Before the monorepo, each team's flaky tests were that team's problem. In a shared repo, a flaky test in any package can block every PR.

We implemented an aggressive quarantine policy: any test that fails intermittently three times in a week moves to a separate quarantine suite that runs outside the blocking pipeline. It still runs — you get visibility — but it doesn't block merges.

The threshold sounds generous. In practice it forced us to fix flakiness we'd been ignoring for months, because quarantined tests are visible to everyone and carry a team's name. Social pressure works.

Code Ownership at Scale

With 10+ apps across multiple teams, review assignment becomes ambiguous fast. We set up CODEOWNERS at the package level — not just the app level. Every shared internal package has a designated owner team.

CI enforces that every PR has an approved review from the relevant CODEOWNERS entry before merge. Without it, review assignments become informal, ownership becomes unclear, and the repo slowly becomes nobody's responsibility.

What's In Migration Right Now

We're not working from a clean slate. We're migrating four active systems in parallel while continuing to ship on the legacy systems they're replacing.

The payments platform is being migrated from the legacy stack to the frontend monorepo. This is one of our highest-stakes surfaces — payment flows require correctness guarantees that make a parallel migration (run old and new side-by-side, validate parity, cut over) the only responsible approach. We've been running that parallel validation for months.

The legacy cron jobs were migrated to NestJS. This one is worth spending a moment on because the results were the clearest we've seen. The legacy crons were running on the PHP backend's scheduler — a homebrew system with reliability characteristics that were, charitably, unpredictable. After migration to NestJS with a proper scheduling library and observability tooling, we saw a 25× performance improvement and the visibility into cron execution that we simply didn't have before. We could see what was running, when, how long it took, and what failed. That sounds basic. On the legacy system, it wasn't.

The legacy PHP APIs are being migrated in phases via strangler-fig: new NestJS implementations go live for new consumers, while old consumers continue hitting the PHP endpoints until we've verified the new implementations and can cut over. This is slow and requires maintaining both implementations simultaneously. It's also the only approach that doesn't require a high-risk big-bang cutover on a production API surface.

WordPress pages are being moved into the Next.js stack. This is the most visible migration to external users — the pages look different, perform differently, and require coordinating with content and SEO teams who have strong opinions about what changes. We've learned to loop those teams in early rather than presenting finished implementations.

Real Outcomes So Far

I'm going to be specific here because general claims about architecture migrations are easy to make and hard to trust.

Uptime: 95% → 99%. The pre-migration baseline was 95% uptime on the affected systems. Post-migration, we're at 99%. This is a direct result of more reliable infrastructure and better observability, not a magic property of the new tech stack.

25× cron performance improvement. The starkest single improvement we've seen. The legacy cron system was opaque and slow. The NestJS migration gave us speed, visibility, and reliability in one move.

12-minute Docker build → 3 minutes. Turborepo's prune command plus Docker layer caching on the dependency install step. For customer-facing apps where hotfixes need to ship fast, this compounds across every deploy.

Faster developer feedback loops. The consistent feedback from every team that has moved to either monorepo is that the development experience is meaningfully better. Turborepo's caching means incremental builds instead of full rebuilds. CI is faster. The feedback loop is tighter.

What Transfers Between the Two Monorepos

The patterns that worked in one transferred cleanly to the other: branch-namespaced remote cache, explicit env declarations in every task, Docker pruning for builds, --no-cache on tests.

The things that don't transfer: the right --concurrency setting (different runner specs), the right flakiness threshold (different team cultures), and the right package ownership granularity (different team structures). These are team-specific decisions that have to be made locally, not standardized.

What's Still Honest-to-God Hard

I don't want to write a migration success story that makes this seem cleaner than it is.

Running dual systems is expensive. Every system in parallel migration requires maintaining two implementations, two test suites, and the validation layer that verifies parity between them. The teams doing this work are carrying more cognitive load than they'd have on a pure greenfield project. We try to minimize the overlap period, but we can't rush validation on systems that handle payments or core platform APIs.

Scope creep during migration. When you're already touching a system to migrate it, it's tempting to also fix the things that were always wrong with it. Sometimes that's the right call; sometimes it's scope creep that delays the migration without proportional benefit. We've gotten better at drawing the line, but we still have the argument regularly.

Team coordination across two monorepos. The organizational benefit of two separate monorepos comes with the cost of explicit coordination when something spans them. A customer-facing feature that requires a new internal API endpoint means coordinating across two teams working in two repos with potentially different sprint rhythms. We're still developing the process for this.

Legacy consumers that are hard to migrate. Not every PHP backend consumer is easy to move to NestJS. Some have implicit dependencies on the legacy system's behavior that aren't documented anywhere. Migrating them requires discovery work that's slow and unpredictable.

The Verdict

The Tech Lead → Junior Associate Architect transition in April 2026 formalized something that was already true in practice: the scope of the work had expanded beyond shipping features on a single team and into shaping the architecture that multiple teams build on.

Turborepo is worth it for both use cases. The local DX improvement is real and sustained. Docker pruning is a significant CI win. The coordination reduction on internal systems is real — one PR for atomic changes versus the multi-repo dance we used to do.

The hidden cost is operational knowledge. Cache poisoning isn't theoretical — non-deterministic outputs, undeclared environment variables, and flaky tests will bite you if you don't design for them upfront. The Turborepo standard, the two-monorepo pattern, the NestJS decision for internal services — these are decisions that will constrain and enable engineering choices across the org for years. Getting them right matters more than any individual feature.

Both monorepos are now the default direction for new app development at Simplilearn. Not mandated from the top — adopted because the teams that tried them didn't want to go back.

Go in with explicit env declarations in every task, branch-namespaced remote cache, --no-cache on tests, and CODEOWNERS at the package level. Build from that foundation and the rest is manageable.

The migration is not finished. The decisions are still being validated by production behavior. The outcomes are real but partial. What I can say with confidence: we know where we're going, we have a coherent architecture for getting there, and the early results support the direction we chose.

The rest is execution.

Shubham Gupta is a Tech Lead (→ Junior Associate Architect) at Simplilearn, an edtech platform serving 1M+ monthly organic visitors. He leads frontend architecture migration and agentic tooling adoption for the engineering org.

// RELATED

01ARCHITECTURE

APR 2026 · 5 MIN→

Five Legacy Systems, Zero Downtime: The Migration Playbook

NEXTJSNESTJSMIGRATIONLEGACYSTRANGLER-FIG

02PERFORMANCE

OCT 2025 · 3 MIN→

The Frontend Performance Journey Behind Simplilearn's Next.js Migration

LCPWEB-VITALSNEXTJSREACT