How to design a personalized content recommendation engine
A recommendation engine is a system that analyzes user behavior, preferences, and content attributes to surface personalized suggestions—predicting what a user wants to see before they search for it. This is one of the most business-critical system design problems: Netflix's recommendations drive 80% of viewing hours, Amazon's engine generates 35% of revenue, and YouTube's algorithm produces 70% of total watch time. In system design interviews, the recommendation engine tests everything: data pipelines, ML infrastructure, caching, real-time serving at scale, and the ability to balance competing objectives like relevance, diversity, and freshness. Every FAANG company asks some variant of this question because it maps directly to their core product.
Key Takeaways
- The modern recommendation architecture is a multi-stage funnel: candidate generation (retrieve thousands of items cheaply) → ranking (score hundreds with heavier models) → re-ranking (apply business rules for diversity, freshness, and content policy). This three-stage pipeline is the industry standard at Netflix, YouTube, and Spotify.
- Two filtering approaches drive recommendations: collaborative filtering (users who liked X also liked Y) and content-based filtering (this item has similar attributes to items you liked). Production systems use hybrid approaches combining both.
- The cold start problem—how to recommend content to new users or surface new items without interaction history—is the most common follow-up question. Solve it with popularity-based defaults, demographic profiles, or onboarding preference surveys.
- Latency requirements are strict: recommendations must be served within 100–200ms. This demands pre-computation of candidate sets, aggressive caching, and lightweight ranking models for real-time serving.
- In interviews, discuss the entire pipeline—not just the ML model. Data collection, feature engineering, model training, serving infrastructure, and monitoring are equally important. Interviewers evaluate end-to-end systems thinking, not algorithmic knowledge alone.
Step 1: Requirements and Scope
Functional requirements:
Personalized recommendations: Suggest content tailored to each user's preferences and behavior. Real-time updates: Adjust recommendations as the user interacts with the platform (watches a video, clicks an article, purchases a product). Multiple recommendation surfaces: Home feed, "More Like This," "Trending," "Because You Watched X." New content surfacing: Ensure recently added content reaches relevant users even without interaction history.
Non-functional requirements:
Latency: Serve personalized recommendations within 200ms (p99). Scalability: Support 100M+ daily active users with a catalog of 10M+ items. Availability: 99.99% uptime—if recommendations fail, users see no content suggestions. Freshness: Incorporate user interactions from the last few minutes into recommendations. Relevance: Measurable improvement in engagement metrics (click-through rate, watch time, conversion rate).
Interview tip: Ask the interviewer: "What type of content are we recommending—videos, products, articles?" and "Should recommendations be personalized per user or globally ranked?" These scoping questions determine whether you need real-time user signals, collaborative filtering, or simpler popularity-based ranking.
Step 2: Back-of-Envelope Estimation
Users: 100M DAU, each requesting recommendations 10 times per day = 1B recommendation requests/day = ~11,600 QPS average, ~35,000 QPS peak.
Catalog: 10M items. Each item has a feature vector of approximately 1 KB (embeddings, metadata, interaction counts). Total catalog feature storage: ~10 GB—fits entirely in memory.
User profiles: 100M users × 5 KB per profile (interaction history, embeddings, preferences) = 500 GB. Too large for a single machine; requires sharding or a distributed feature store.
Model inference: Ranking 500 candidates per request with a lightweight model must complete within 50ms to stay within the 200ms total latency budget (leaving 150ms for candidate retrieval, feature lookup, and network overhead).
Step 3: The Three-Stage Recommendation Pipeline
This funnel architecture is the industry standard used by Netflix, YouTube, Spotify, and Amazon.
Stage 1: Candidate Generation
Purpose: From a catalog of 10M items, retrieve 500–1,000 candidates that could be relevant to the user. This stage must be fast (under 50ms) and recall-oriented—it is better to include too many candidates than to miss a relevant one.
Approaches:
Collaborative filtering: Find items liked by users similar to the current user. Matrix factorization (ALS, SVD) and nearest-neighbor search in embedding space are common techniques. "Users who watched Breaking Bad also watched Better Call Saul."
Content-based filtering: Match item attributes (genre, tags, topics, embeddings) to the user's preference profile. "You watched three sci-fi thrillers this week, here are more sci-fi thrillers."
Two-tower model (industry standard in 2026): Separate neural networks encode user features and item features into the same embedding space. At serving time, the user embedding is computed once, and an approximate nearest neighbor (ANN) search finds the closest item embeddings. This is the architecture Netflix, YouTube, and Meta use for candidate generation at scale.
Multiple generators run in parallel: A production system uses 5–20 candidate generators simultaneously—one for collaborative filtering, one for content similarity, one for trending content, one for items from subscribed channels, one for geographic popularity. Results are merged and deduplicated before passing to the ranking stage.
Interview application: "I would use a two-tower model for the primary candidate generator. The user tower encodes interaction history, demographics, and context into a 128-dimensional embedding. The item tower encodes content features, popularity signals, and freshness into the same space. At serving time, I perform an ANN search using FAISS to retrieve the 500 nearest items in under 10ms. I would also run a trending generator and a subscription-based generator in parallel for diversity."
Stage 2: Ranking
Purpose: From the 500 candidates, score each one with a more computationally expensive model to predict the probability that the user will engage (click, watch, purchase). Output: a ranked list of 50–100 items.
Model architecture: A deep learning model (typically a multi-layer neural network or gradient-boosted decision tree) takes rich features as input: user features (demographics, past interactions, session context), item features (content type, recency, popularity, quality score), and cross features (user-item interaction history, time since last interaction with similar content).
Feature store: A critical infrastructure component that serves precomputed features to the ranking model at inference time with single-digit millisecond latency. A feature store (Feast, Tecton) maintains two tables: an offline store (for model training with historical features) and an online store (for real-time serving with current features in Redis or DynamoDB).
Interview application: "The ranking model is a two-layer neural network with 256 hidden units, trained on click and watch-time labels. It receives 150 features per user-item pair from the feature store: 50 user features, 50 item features, and 50 cross features. The model predicts the probability of a 30-second watch. Inference on 500 candidates takes approximately 30ms on a GPU-backed serving instance."
Stage 3: Re-Ranking
Purpose: Apply business rules, diversity constraints, and content policy filters to the ranked list before presenting it to the user. Output: the final 10–20 recommendations displayed on the screen.
What re-ranking handles:
Diversity: Prevent the list from being dominated by one genre or content type. If 8 of the top 10 ranked items are action movies, inject 2–3 items from other genres to broaden the user's experience.
Freshness: Boost recently published content to ensure new items get initial exposure and feedback, even if the model has not yet learned they are relevant.
Business rules: Promote sponsored content or premium titles in specific positions. Suppress content the user has already seen. Filter out items violating content policy.
Exploration vs exploitation: Reserve 10–20% of recommendation slots for items the user has not interacted with in similar categories—exploring new interests rather than exploiting known preferences. Multi-armed bandit algorithms (Thompson sampling, epsilon-greedy) balance this trade-off.
Step 4: Handling the Cold Start Problem
The cold start problem is the most frequently asked follow-up question in recommendation system interviews.
| Scenario | Problem | Solution |
|---|---|---|
| New user | No interaction history to personalize from | Popularity-based defaults, demographic-based preferences, onboarding survey ("Select genres you enjoy") |
| New item | No interaction data to rank with | Content-based features (genre, tags, description embeddings), editorial boosting, explore-slot allocation |
| New user + new item | No data on either side | Global popularity ranking, random exploration in designated slots |
Interview application: "For new users, I would show globally popular content for the first session while collecting implicit signals (what they click, how long they watch). After 5–10 interactions, the collaborative filtering model has enough signal to begin personalization. For new items, I would use content-based features (genre, tags, embedding similarity to popular items) to seed the candidate generator, and allocate 10% of recommendation slots as 'explore' positions where new items receive guaranteed exposure."
Step 5: Data Pipeline and Model Training
Data collection: User interactions (clicks, views, watch time, purchases, skips, ratings) are logged as events and published to Kafka. A stream processor (Flink, Spark Streaming) enriches events with user and item metadata and writes them to the feature store and training data store.
Offline training: The ranking model is retrained daily on the latest interaction data using Spark or distributed TensorFlow. Training data includes positive samples (items the user engaged with) and negative samples (items shown but not engaged with). The trained model is deployed to the serving infrastructure.
Online serving: At request time, the recommendation service calls the candidate generators, retrieves features from the online feature store, runs the ranking model, applies re-ranking rules, and returns the final list—all within 200ms.
| Component | Technology | Latency Budget |
|---|---|---|
| Candidate generation (ANN search) | FAISS, Milvus, or Pinecone | 10–20ms |
| Feature retrieval | Redis, DynamoDB (feature store) | 5–10ms |
| Ranking model inference | TensorFlow Serving, TorchServe (GPU) | 20–30ms |
| Re-ranking and business rules | Application logic | 5–10ms |
| Network overhead + serialization | gRPC between services | 10–20ms |
| Total | 50–90ms (within 200ms budget) |
For structured practice on recommendation system design and 17 other real-world case studies, Grokking the System Design Interview covers the complete design process.
For advanced ML system design patterns including two-tower architectures, feature stores, and production-scale model serving, Grokking the Advanced System Design Interview builds the depth required for L6+ interviews.
The system design interview guide provides the broader framework for approaching any system design problem.
Step 6: Monitoring and Evaluation
- Online metrics: Click-through rate (CTR), watch time, conversion rate, session duration. A/B test every model change against the current production model before full rollout.
- Offline metrics: Precision@K, Recall@K, NDCG (Normalized Discounted Cumulative Gain). These measure recommendation quality on held-out test data during model development.
- Guardrail metrics: Diversity score (how varied are recommendations?), freshness score (are new items being surfaced?), coverage (what percentage of the catalog appears in recommendations?). These prevent the model from optimizing engagement at the expense of content diversity or new creator visibility.
Common Interview Follow-Up Questions
- "How do you avoid filter bubbles?"
Allocate 10–20% of recommendation slots to exploration. Use diversity constraints in re-ranking. Monitor coverage metrics to ensure the model surfaces a broad range of content. - "How do you handle real-time context?"
Session-based signals (items viewed in the current session) are incorporated as features in the ranking model. A lightweight online model adjusts scores based on the last 5–10 interactions without retraining the full model. - "How would you scale to 1B users?"
Shard user profiles across multiple feature store nodes by user_id. Pre-compute candidate sets for users with stable preferences and cache them. Use ANN indices partitioned by content category to parallelize candidate retrieval. - "What happens if the recommendation service goes down?"
Graceful degradation: serve cached recommendations from the user's last session, or fall back to globally popular content. The user experience degrades but does not disappear.
Frequently Asked Questions
What is the three-stage recommendation pipeline?
Candidate generation retrieves 500–1,000 items cheaply from a catalog of millions. Ranking scores these candidates with a heavier model to predict engagement. Re-ranking applies business rules (diversity, freshness, content policy) to produce the final 10–20 items shown to the user. This is the standard architecture at Netflix, YouTube, and Spotify.
What is the difference between collaborative and content-based filtering?
Collaborative filtering recommends items based on similar users' behavior ("users who liked X also liked Y"). Content-based filtering recommends items based on attribute similarity to items the user already liked. Production systems use hybrid approaches combining both for better coverage and accuracy.
How do you solve the cold start problem?
For new users: show popular content and collect implicit signals (clicks, watch time) until personalization is possible (typically 5–10 interactions). For new items: use content-based features and allocate exploration slots for guaranteed exposure. For both simultaneously: fall back to global popularity.
What is a feature store?
Infrastructure that serves precomputed features to ML models at inference time with single-digit millisecond latency. It maintains an offline store (historical features for training) and an online store (current features for serving). Feast and Tecton are common implementations. Redis or DynamoDB typically back the online store.
How fast should recommendations be served?
Within 100–200ms (p99). The latency budget is split across candidate generation (10–20ms), feature retrieval (5–10ms), model inference (20–30ms), re-ranking (5–10ms), and network overhead (10–20ms). Pre-computation and caching are essential to meeting this budget at 100M+ DAU.
What is a two-tower model?
Separate neural networks encode user features and item features into the same embedding space. At serving time, the user embedding is computed once and an ANN search finds the nearest item embeddings. This architecture enables efficient retrieval from millions of items in milliseconds.
How do you measure recommendation quality?
Online: CTR, watch time, conversion rate, session duration via A/B testing. Offline: Precision@K, Recall@K, NDCG on held-out data. Guardrails: diversity score, freshness score, catalog coverage to prevent over-optimization on engagement alone.
How do you prevent filter bubbles?
Allocate 10–20% of recommendation slots to exploration (items outside the user's established preferences). Apply diversity constraints in re-ranking to limit genre/category concentration. Monitor catalog coverage to ensure broad content surfacing. Use multi-armed bandits to balance exploration vs exploitation.
What database should I use for the recommendation system?
Kafka for event streaming. A feature store (Redis/DynamoDB online, S3/BigQuery offline) for feature serving. A vector database (FAISS, Milvus, Pinecone) for ANN search in candidate generation. PostgreSQL or DynamoDB for item metadata. The choice depends on access pattern—each component has a different storage need.
Why do interviewers ask about recommendation systems?
Because they test the full spectrum of system design skills: data pipelines (Kafka, Spark), ML infrastructure (model training, serving), caching (Redis), scaling (sharding, ANN indices), real-time processing, and trade-off reasoning (relevance vs diversity, latency vs accuracy). It is the most comprehensive single question in system design.
TL;DR
A recommendation engine follows a three-stage funnel: candidate generation (retrieve 500–1,000 items from 10M+ catalog using two-tower models and ANN search in 10–20ms), ranking (score candidates with a neural network using 150 features from a feature store in 20–30ms), and re-ranking (apply diversity, freshness, and business rules in 5–10ms). Total latency stays under 200ms. Collaborative filtering powers "users who liked X also liked Y." Content-based filtering matches item attributes to user preferences. Hybrid approaches combine both. Solve the cold start problem with popularity defaults for new users and content-based features plus exploration slots for new items. Netflix drives 80% of viewing hours through recommendations. YouTube generates 70% of watch time. Amazon produces 35% of revenue. In interviews, discuss the full pipeline—data collection, feature engineering, model training, serving, and monitoring—not just the ML model.
GET YOUR FREE
Coding Questions Catalog

$197

$72

$78