Engineering

Scaling Multi-Tenant LMS Architecture: Lessons from 50K Monthly Active Learners

A single-tenant LMS that fell over at 10K test users, rebuilt to serve 50K monthly learners across 23 institutions at 89ms average. Here's what actually moved the needle.

NevkaSystems TeamEngineering

June 18, 2026  ·  10 min read

TL;DR

We turned a single-institution LMS into a multi-tenant platform serving 50K+ learners across 23 orgs at 99.95% uptime by picking schema-per-tenant isolation, sharding on tenant ID, and caching in four layers.

Key takeaways

1Schema-per-tenant beat both shared-schema and database-per-tenant for us: real isolation without 23 databases to babysit.

2Add PostgreSQL row-level security on day one. Retrofitting it onto live tables cost us weeks.

3Caching the boring stuff (course content, tenant config) dropped average response time from 450ms to 89ms and DB CPU from 85% to 25%.

4Shard on tenant ID, give your whales dedicated clusters, and let small tenants share. Don't pre-shard everything.

5Per-tenant dashboards aren't a nice-to-have. Without them you're debugging blind across 20+ orgs.

What we walked into

The client had a working LMS, built for exactly one institution. Then they won a string of enterprise contracts and needed to host 20+ separate organizations on the same platform. The existing system was never going to get there.

Everything lived in one schema. No tenant separation, so one institution's slow query dragged the whole platform down. Scaling meant a bigger, pricier single server and nothing else. Every update was a coordinated downtime window. Tenant-specific settings were hard-coded and scattered across the codebase wherever someone happened to need them.

The contract terms weren't flexible either: complete data isolation between tenants, 50K+ concurrent learners at peak, sub-200ms response on anything a user touches, and a 99.9% uptime SLA. That last one allows 8.7 hours of downtime per year. Total.

We load-tested the old system at 10,000 concurrent users to see how far it was from those numbers. Average response time came in at 2.3 seconds against a 200ms target. Error rate hit 12% from timeouts and database locks. Database CPU pinned at 100% and stayed there, with several full outages during the run. This wasn't a tuning problem. It needed a different architecture.

Picking a tenancy model

The first real decision was how to separate tenants, and all three options cost us something.

· Shared database, shared schema. One set of tables, a tenant_id column on everything. Cheap and simple to deploy, but a single bad WHERE clause leaks data across tenants, and noisy neighbors are baked in. Given our isolation requirement was contractual, we passed.

· Separate databases per tenant. Total isolation, independent scaling, and an operational bill we didn't want. Managing 20+ databases, with 20+ backup schedules and migration runs, is a full-time job nobody asked for. Passed.

· Shared database, separate schemas. Each tenant gets its own PostgreSQL schema. Strong isolation, per-tenant backups, and migrations that are more involved but manageable. This is what we shipped.

Schema-per-tenant gave us the isolation the contract demanded without turning operations into a zoo. Then we added PostgreSQL row-level security underneath it as a second wall. Even with separate schemas, RLS means a bug in our application code physically can't return another tenant's rows. The database enforces it, not our middleware. If we'd known then what we know now, we'd have built RLS in from the first commit instead of retrofitting it onto populated tables, which ate weeks.

Sharding the data

50K learners generate a lot of rows: course progress, quiz attempts, activity logs, all of it growing daily. We shard on tenant ID. Large tenants, the ones over 5,000 learners, get a dedicated cluster. Everyone smaller shares. We don't pre-shard the small ones into oblivion; we move a tenant to its own cluster when its size justifies it, not before.

Analytics was the other pressure point. Completion reports and engagement metrics are heavy, and we refused to let a report query slow down someone trying to take a quiz. Those run against read replicas; the primary stays free for user-facing traffic. Each app server keeps a connection pool per cluster: 10 connections minimum, 50 max, 30-second idle timeout, automatic retry and failover on a dropped connection.

Four layers of cache

Schema and sharding choices got us isolation and headroom. They did not get us to 200ms on their own. Caching did, and we layered it deliberately rather than throwing Redis at everything.

· CDN for static assets. CloudFront fronts videos, images, JS, and CSS. 94% cache hit rate, so most of that never touches our servers.

· Redis for hot application data. The frequently-read stuff that changes occasionally.

· In-memory for tenant config. Held in application memory with a 5-minute TTL, because config changes almost never and a process-local read beats a network hop every time.

· Query result cache for the expensive aggregations. Leaderboards and progress summaries that are costly to compute and fine to serve slightly stale.

The numbers tell the story. Average response time went from 450ms to 89ms. P95 went from 1.2 seconds to 180ms. Database load dropped from 85% CPU to 25%, and overall cache hit rate settled around 78%. The biggest wins weren't clever, they were boring: caching course content (which rarely changes) and tenant configuration (which changes almost never). The boring caches are the ones that paid.

Where it landed

Six months in production, every SLA target met or beaten. 50,000+ monthly active learners across 23 tenant organizations, 2.1 million course completions tracked, 400+ concurrent users at peak. Uptime came in at 99.95% against the 99.9% target. Average response time 89ms, P95 180ms, both under the 200ms line. Zero data isolation incidents.

It runs lean: 3 application servers that auto-scale to 6 during peaks, 2 database clusters each with a primary and replica, and a 3-node Redis cluster. Monthly infrastructure cost is about $4,200. The original architecture was projected at $12,000 for the same load, so the rebuild pays for itself every month.

Three things we'd change next time. Build RLS from day one instead of bolting it on later. Stand up per-tenant monitoring early; we burned weeks on bugs that a per-tenant dashboard would have made obvious in minutes. And put PgBouncer in front of PostgreSQL from the start instead of pooling at the application layer, which would have simplified a chunk of our own code.

The architecture has held. We've onboarded 8 more tenants since launch with no changes to the core, just adding database capacity as the load called for it. That's the real test of a multi-tenant design: growth becomes a capacity question, not a rewrite.

Want help implementing this?

We help teams design and ship production-grade software in eLearning, fintech, and AI. Let's talk about your project.

Book a call

Related articles

Engineering

From Monolith to Modular: A Migration Strategy That Doesn't Kill Delivery

June 18, 2026 · 11 min read

Engineering

CI/CD That Actually Ships: Our Practical Pipeline for Web Apps

June 18, 2026 · 12 min read

Product

How We Ship MVPs in 6 Weeks Without Cutting Corners

June 18, 2026 · 8 min read

← All insights

 Engineering