Scaling Multi-Tenant LMS Architecture: Lessons from 50K Monthly Active Learners
How we designed a multi-tenant learning management system that scales horizontally while maintaining data isolation, low latency, and high availability for enterprise clients.
We built a multi-tenant LMS that scales horizontally using database sharding, caching strategies, and isolated tenant architectures. The platform now serves 50K+ monthly active learners across 20+ institutions with 99.9% uptime and sub-200ms response times.
The Challenge
When we started this project, our client had a working LMS—but it was built for a single institution. They'd won several enterprise contracts and suddenly needed to support 20+ independent organizations on the same platform. The existing architecture couldn't handle it.The original system's limitations:- Monolithic database — All data in one schema, no tenant separation- Vertical scaling only — Running on one increasingly expensive server- No isolation — A slow query from one tenant affected everyone- Manual deployments — Updates required coordinated downtime- Hard-coded configurations — Tenant-specific settings were scattered through the codeKey requirements for the new system:- Support 20+ independent institutions (tenants)- Complete data isolation between tenants (contractual requirement)- Handle 50K+ concurrent learners during peak periods- Sub-200ms response times for all user-facing operations- 99.9% uptime SLA (only 8.7 hours downtime per year allowed)What failure looked like:During a load test simulating 10,000 concurrent users, the original system:- Average response time: 2.3 seconds (target was 200ms)- Error rate: 12% (timeouts and database locks)- Database CPU: 100% sustained- Several complete outages during the testWe needed a fundamentally different architecture.
Architecture Decisions
The first major decision was choosing a multi-tenancy strategy. Each approach has tradeoffs:Option 1: Shared Database, Shared Schema- All tenants in the same tables with a `tenant_id` column- Pros: Simple deployment, efficient resource use- Cons: Risk of data leakage, noisy neighbor problems, complex queries- *Rejected:* Too risky for our data isolation requirementsOption 2: Shared Database, Separate Schemas- Each tenant gets their own PostgreSQL schema- Pros: Strong isolation, per-tenant backups, moderate resource efficiency- Cons: More complex migrations, connection management- *Selected:* Best balance of isolation and operational efficiencyOption 3: Separate Databases Per Tenant- Complete database isolation per tenant- Pros: Maximum isolation, independent scaling- Cons: Expensive, complex operations, high overhead- *Rejected:* Too expensive to manage at 20+ tenantsOur implementation:
1 // Tenant resolution middleware
2 const tenantMiddleware = async (req: Request, res: Response, next: NextFunction) => {
3 // Extract tenant from subdomain: acme.learn.platform.com
4 const subdomain = req.hostname.split('.')[0];
5
6 // Lookup tenant configuration
7 const tenant = await tenantCache.get(subdomain) ||
8 await db.tenant.findUnique({ where: { subdomain } });
9
10 if (!tenant) {
11 return res.status(404).json({ error: 'Tenant not found' });
12 }
13
14 // Set schema for this request
15 req.tenantId = tenant.id;
16 req.tenantSchema = `tenant_${tenant.id}`;
17
18 // Configure database connection for this tenant's schema
19 await setSearchPath(req.tenantSchema);
20
21 next();
22 };Row-Level Security as a safety net:Even with separate schemas, we implemented PostgreSQL row-level security as defense in depth:
1 -- Enable RLS on sensitive tables
2 ALTER TABLE courses ENABLE ROW LEVEL SECURITY;
3
4 -- Policy ensures tenant isolation even if schema switching fails
5 CREATE POLICY tenant_isolation ON courses
6 USING(tenant_id = current_setting('app.current_tenant')::uuid);This means even a bug in our application code can't leak data between tenants—the database itself enforces isolation.
Database Sharding Strategy
With 50K+ learners generating course progress, quiz attempts, and activity logs, our database grew quickly. We implemented a sharding strategy to handle the load.Sharding approach:We use tenant ID as the partition key, distributing tenants across database clusters. Large tenants (>5000 learners) get dedicated clusters; smaller tenants share clusters.
1 // Connection routing based on tenant size and load
2 class TenantConnectionPool {
3 private pools: Map<string, Pool> = new Map();
4
5 async getConnection(tenantId: string): Promise<PoolClient> {
6 const routing = await this.getRouting(tenantId);
7
8 // Check if tenant has dedicated pool
9 if (routing.dedicated) {
10 return this.getDedicatedPool(routing.clusterId).connect();
11 }
12
13 // Use shared pool for smaller tenants
14 return this.getSharedPool(routing.clusterId).connect();
15 }
16
17 private async getRouting(tenantId: string): Promise<TenantRouting> {
18 // Check cache first
19 const cached = await redis.get(`routing:${tenantId}`);
20 if (cached) return JSON.parse(cached);
21
22 // Calculate routing based on tenant metrics
23 const tenant = await this.getTenantMetrics(tenantId);
24 const routing = {
25 dedicated: tenant.learnerCount > 5000,
26 clusterId: this.selectCluster(tenant),
27 };
28
29 await redis.setex(`routing:${tenantId}`, 3600, JSON.stringify(routing));
30 return routing;
31 }
32 }Read replicas for analytics:Heavy analytics queries (completion reports, engagement metrics) run against read replicas, keeping the primary database responsive for user interactions:
1 // Route queries based on type
2 const getConnection = async (queryType: 'write' | 'read' | 'analytics') => {
3 switch (queryType) {
4 case 'write':
5 return primaryPool.connect();
6 case 'read':
7 return readReplicaPool.connect();
8 case 'analytics':
9 return analyticsReplicaPool.connect(); // Higher latency tolerance
10 }
11 };Connection pooling:Each application server maintains connection pools per cluster, with automatic failover:- 10 connections per pool (minimum)- 50 connections per pool (maximum)- 30 second idle timeout- Automatic retry on connection failure
Caching Layers
Database optimization only gets you so far. We implemented caching at multiple layers to achieve sub-200ms response times.Layer 1: CDN (Static Assets)CloudFront serves all static assets—videos, images, JavaScript, CSS. Cache hit rate: 94%.Layer 2: Redis (Application Cache)Frequently accessed data cached in Redis:
1 // Course content rarely changes - cache aggressively
2 const getCourseContent = async (courseId: string): Promise<Course> => {
3 const cacheKey = `course:${courseId}:content`;
4
5 // Try cache first
6 const cached = await redis.get(cacheKey);
7 if (cached) return JSON.parse(cached);
8
9 // Fetch from database
10 const course = await db.course.findUnique({
11 where: { id: courseId },
12 include: { modules: true, lessons: true }
13 });
14
15 // Cache for 1 hour(content changes are rare)
16 await redis.setex(cacheKey, 3600, JSON.stringify(course));
17
18 return course;
19 };
20
21 // Invalidate on content update
22 const updateCourseContent = async (courseId: string, data: CourseUpdate) => {
23 await db.course.update({ where: { id: courseId }, data });
24 await redis.del(`course:${courseId}:content`);
25 };Layer 3: In-Memory (Configuration)Tenant configuration cached in application memory with 5-minute TTL:
1 const tenantConfigCache = new NodeCache({ stdTTL: 300 });
2
3 const getTenantConfig = async (tenantId: string) => {
4 const cached = tenantConfigCache.get(tenantId);
5 if (cached) return cached;
6
7 const config = await db.tenantConfig.findUnique({ where: { tenantId } });
8 tenantConfigCache.set(tenantId, config);
9 return config;
10 };Layer 4: Query Result CachePostgreSQL query cache for complex aggregations (leaderboards, progress summaries).Performance impact:| Metric | Before Caching | After Caching ||--------|----------------|---------------|| Avg Response Time | 450ms | 89ms || P95 Response Time | 1.2s | 180ms || Database Load | 85% CPU | 25% CPU || Cache Hit Rate | N/A | 78% |The biggest wins came from caching course content (which changes rarely) and tenant configuration (which changes almost never).
Results & Metrics
After 6 months in production, the platform consistently meets all SLA requirements:Scale achieved:- 50,000+ monthly active learners- 23 active tenant organizations- 2.1 million course completions tracked- 400+ concurrent users during peak hoursPerformance metrics:- 99.95% uptime (exceeded 99.9% target)- 89ms average response time (target was 200ms)- 180ms P95 response time- Zero data isolation incidentsInfrastructure efficiency:- 3 application servers (auto-scaling to 6 during peaks)- 2 database clusters (primary + replica each)- Redis cluster with 3 nodes- Monthly infrastructure cost: ~$4,200 (vs. projected $12,000 for original architecture)What we'd do differently:1. Start with RLS from day one — Adding row-level security to existing tables was more complex than implementing it upfront2. Invest in tenant-level monitoring earlier — We spent weeks debugging issues that would have been obvious with per-tenant dashboards3. Use connection pooling at the edge — PgBouncer in front of PostgreSQL would have simplified our application-level poolingThe architecture continues to scale. We've since onboarded 8 more tenants with no changes to the core infrastructure—just adding database capacity as needed.
We help teams design and ship production-grade software in eLearning, fintech, and AI. Let's talk about your project.
Book a call