Designing a High-Scale Adaptive Learning Platform: A Deeper Dive
Let’s be real: building an AI feature for a hackathon is easy. You verify an idea, wrap a prompt in a Python script, hit the OpenAI API, and you’re done.
Building an AI-native platform that serves 50 million daily active users? That is an entirely different beast.
In the ed-tech space—think Duolingo, Khan Academy, or wildly popular language learning apps—the core value proposition is personalization. The system needs to look at a user’s specific history, their weak spots, and their forgetting curve, and then curate the perfect next lesson. It needs to be challenging enough to teach, but easy enough to prevent churn.
If you try to do this by generating every lesson in real-time using an LLM, your architecture will fail.
In Chapter 4 of my book, System Design for the LLM Era, I break down the architecture of a massive-scale Adaptive Learning Platform. We aren’t looking at toy examples here. We are designing for ~340k requests per second (RPS) at peak load , with a strict latency requirement of under 500ms.
Here is the architectural blueprint for how to solve it.
Check out my new book on System Design for the LLM Era - it is a blueprint for production systems integrating AI
The Trap of Real-Time Generation
The most common mistake engineers make when integrating GenAI is coupling the user request directly to the LLM inference.
In a standard web app, a request-response cycle might look like Client -> Server -> DB -> Client. It’s fast and deterministic. In a naive AI app, it looks like Client -> Server -> LLM (Wait 3-10s) -> Client.
For a learning platform, if a user finishes a lesson and hits Next, they expect the next screen instantly. If they stare at a spinner for 6 seconds while GPT-4 generates a custom exercise, they close the app. Furthermore, at 340k RPS, hitting an external LLM provider for every interaction is financially ruinous and architecturally fragile.
To solve this, we need to decouple Content Generation (slow, expensive, creative) from Content Serving (fast, cheap, deterministic).
We split our architecture into two distinct systems:
System I: The Offline Content Pipeline (The Factory)
System II: The Online Serving Path (The Tutor)
System I: The Offline Content Pipeline
The first goal is safety and quality. We cannot allow an LLM to hallucinate false grammar rules or inappropriate vocabulary in real-time. We need a “human-in-the-loop” workflow that scales.
We treat content generation as an asynchronous supply chain.
1. The Async Trigger
Admins or curriculum designers initiate generation jobs (e.g., “Generate 100 B1-level Spanish questions about ‘Travel’”). We don’t block on this. The API Gateway pushes a message to a generation_jobs_queue and returns a 202 Accepted.
2. The Orchestrator Workers
Workers consume these jobs. This is where the “LLM Orchestrator” lives. It doesn’t just “call the API.” It handles:
Prompt Engineering: Wrapping the request with strict JSON schema constraints and educational guidelines.
Model Routing: Deciding which model to use. Simple grammar variation? Use a small, cheap model (like Haiku or a fine-tuned Llama). Complex scenario generation? Route to GPT-4.
Resilience: Handling the inevitable 429 Rate Limits and 5xx errors from providers using exponential backoff and circuit breakers.
3. The Staging Area & Human Review
Generated content lands in a questions_staging database. It is not live yet. Domain experts review high-risk or low-confidence batches via an internal portal. Only ACCEPTED questions are ingested into the main Question Bank via a nightly job.
This pipeline ensures that our “Online System” always has a deep pool of high-quality, pre-validated building blocks.
System II: The Online Serving Path
This is where the engineering magic happens. How do we serve a personalised lesson to 50M users in <500ms?
We use a pattern I call Proactive Curation.
Instead of asking “What should the user see now?” at the moment they click the button, we ask “What will the user likely need next?” while they are busy doing something else.
Check out my new book on System Design for the LLM Era - it is a blueprint for production systems integrating AI
The Warm Path: Pre-Computation
For an active user, the next lesson should already be waiting in memory.
The Trigger: When a user completes Lesson #4, the application fires an event. This event drops a message into a low-priority
refill_check_queue.The Personalization Job: A background worker picks this up. It grabs the user’s rich context—their “Skill Strength” score from DynamoDB and their recent mistake history.
The Curator: The worker selects a set of candidate questions from the Question Bank that target the user’s weak spots. It sequences them into a coherent lesson plan (e.g., 20 specific question IDs).
The Cache (Redis List): This is the critical optimization. We push this sequence of Question IDs into a Redis List (
precomputed_lesson_playlists) keyed by the UserID.
Now, when the user actually clicks “Start Lesson #5”: The Lesson Curator Service simply executes an atomic LPOP operation on Redis.
Latency: Sub-millisecond.
LLM Dependencies: Zero.
Cost: Minimal.
The user gets a highly personalized experience instantly. The heavy lifting happened 3 minutes ago in a background worker.
The Cold Path: Handling Drift
But what if the user returns after a month? Their cache has expired (TTL). We cannot make them wait for the background job. We trigger the Cold Path.
Cache Miss: The
LPOPreturns nil.Synchronous Fallback: We bypass the complex AI curation. We execute a fast, rule-based SQL query against the Question Bank (e.g., “Get 10 random questions for Level A2 that this user hasn’t seen”).
Serve & Refill: We serve this “good enough” lesson immediately. Simultaneously, we fire the async refill trigger to wake up the Personalization Job.
By the time the user finishes this first generic lesson, their custom “Warm Path” queue is full again.
The Resilience Layer: Designing for Failure
When you build on top of LLMs, you are building on dependencies that are non-deterministic and prone to latency spikes. Your system will fail if you don’t architect for resilience.
In the book, I detail the Model Router and Circuit Breaker patterns used in the Orchestrator.
If your primary model provider (e.g., OpenAI) starts hanging, your Circuit Breaker must detect the timeout (say, >4 seconds). It shouldn’t just fail; it should automatically reroute the request to a backup provider (e.g., Anthropic or a hosted model).
We also implement a Tiered Fallback Strategy:
Tier 1: High-Cost/High-Reasoning Model (e.g., GPT-4) for complex personalization.
Tier 2: Faster/Cheaper Model (e.g., Claude Haiku) if Tier 1 is slow.
Tier 3: Rule-based heuristics if AI is completely unavailable.
This ensures that even in a total outage of your AI provider, your platform stays up.
The Data Tier: Matching Database to Workload
Scale is often a data problem. At 340k RPS, a single monolithic database will melt. We need to align our storage engines with their access patterns.
User Progress (DynamoDB): Writes are massive and bursty. Every time a user answers a question, we write to the log. We use DynamoDB because it handles massive write throughput horizontally. We key by
UserID(Partition) andSessionID(Sort) for fast lookups.Question Bank (PostgreSQL + pgvector): This is read-heavy but requires complex queries. We need structured data (difficulty level, language) and semantic search (finding similar questions using vector embeddings). PostgreSQL with the
pgvectorextension gives us the best of both worlds without needing a separate niche vector database.Curated Cache (Redis): This requires the lowest possible latency for the “Warm Path.” Redis Lists are the perfect data structure for our FIFO lesson buffers.
Why This Matters
This architecture allows us to treat AI not as a bottleneck, but as an offline intelligence engine.
By decoupling the generation (System I) from the serving (System II), we achieve the “impossible” triad:
High Personalization: Every user gets a unique curriculum.
Low Latency: Lessons load instantly via Redis.
Cost Efficiency: We don’t burn tokens on real-time traffic; we batch-process in the background.
This is just a glimpse into Chapter 4. The full book, System Design for LLM Integration, goes much deeper. It covers:
Chapter 3 (AI-Native IDEs): How to handle privacy and Merkle-tree syncing for codebases.
Chapter 5 (E-Commerce Search): How to build hybrid RAG systems for semantic product discovery.
Chapter 6 (Customer Support): Implementing GraphRAG to solve complex reasoning tickets.
We are moving past the era of Hackathon AI engineering and into the era of Production Systems AI Engineering. If you want the blueprint for building production-grade AI systems, this is the book for you. Preorder on Amazon today.
Check out my new book on System Design for the LLM Era - it is a blueprint for production systems integrating AI

