How to design a personalized content recommendation engine

Question

Design Gurus · Accepted Answer

A recommendation engine is a system that analyzes user behavior, preferences, and content attributes to surface personalized suggestions—predicting what a user wants to see before they search for it. This is one of the most business-critical system design problems: Netflix's recommendations drive 80% of viewing hours, Amazon's engine generates 35% of revenue, and YouTube's algorithm produces 70% of total watch time. In system design interviews, the recommendation engine tests everything: data pipelines, ML infrastructure, caching, real-time serving at scale, and the ability to balance competing objectives like relevance, diversity, and freshness. Every FAANG company asks some variant of this question because it maps directly to their core product.

Key Takeaways

The modern recommendation architecture is a multi-stage funnel: candidate generation (retrieve thousands of items cheaply) → ranking (score hundreds with heavier models) → re-ranking (apply business rules for diversity, freshness, and content policy). This three-stage pipeline is the industry standard at Netflix, YouTube, and Spotify.  
Two filtering approaches drive recommendations: collaborative filtering (users who liked X also liked Y) and content-based filtering (this item has similar attributes to items you liked). Production systems use hybrid approaches combining both.  
The cold start problem—how to recommend content to new users or surface new items without interaction history—is the most common follow-up question. Solve it with popularity-based defaults, demographic profiles, or onboarding preference surveys.  
Latency requirements are strict: recommendations must be served within 100–200ms. This demands pre-computation of candidate sets, aggressive caching, and lightweight ranking models for real-time serving.  
In interviews, discuss the entire pipeline—not just the ML model. Data collection, feature engineering, model training, serving infrastructure, and monitoring are equally important. Interviewers evaluate end-to-end systems thinking, not algorithmic knowledge alone.

Step 1: Requirements and Scope

Functional requirements:

Personalized recommendations: Suggest content tailored to each user's preferences and behavior. Real-time updates: Adjust recommendations as the user interacts with the platform (watches a video, clicks an article, purchases a product). Multiple recommendation surfaces: Home feed, "More Like This," "Trending," "Because You Watched X." New content surfacing: Ensure recently added content reaches relevant users even without interaction history.

Non-functional requirements:

Latency: Serve personalized recommendations within 200ms (p99). Scalability: Support 100M+ daily active users with a catalog of 10M+ items. Availability: 99.99% uptime—if recommendations fail, users see no content suggestions. Freshness: Incorporate user interactions from the last few minutes into recommendations. Relevance: Measurable improvement in engagement metrics (click-through rate, watch time, conversion rate).

Interview tip: Ask the interviewer: "What type of content are we recommending—videos, products, articles?" and "Should recommendations be personalized per user or globally ranked?" These scoping questions determine whether you need real-time user signals, collaborative filtering, or simpler popularity-based ranking.

Step 2: Back-of-Envelope Estimation

Users: 100M DAU, each requesting recommendations 10 times per day = 1B recommendation requests/day = ~11,600 QPS average, ~35,000 QPS peak.

Catalog: 10M items. Each item has a feature vector of approximately 1 KB (embeddings, metadata, interaction counts). Total catalog feature storage: ~10 GB—fits entirely in memory.

User profiles: 100M users × 5 KB per profile (interaction history, embeddings, preferences) = 500 GB. Too large for a single machine; requires sharding or a distributed feature store.

Model inference: Ranking 500 candidates per request with a lightweight model must complete within 50ms to stay within the 200ms total latency budget (leaving 150ms for candidate retrieval, feature lookup, and network overhead).

Step 3: The Three-Stage Recommendation Pipeline

This funnel architecture is the industry standard used by Netflix, YouTube, Spotify, and Amazon.

Stage 1: Candidate Generation

Purpose: From a catalog of 10M items, retrieve 500–1,000 candidates that could be relevant to the user. This stage must be fast (under 50ms) and recall-oriented—it is better to include too many candidates than to miss a relevant one.

Approaches:

Collaborative filtering: Find items liked by users similar to the current user. Matrix factorization (ALS, SVD) and nearest-neighbor search in embedding space are common techniques. "Users who watched Breaking Bad also watched Better Call Saul."

Content-based filtering: Match item attributes (genre, tags, topics, embeddings) to the user's preference profile. "You watched three sci-fi thrillers this week, here are more sci-fi thrillers."

Two-tower model (industry standard in 2026): Separate neural networks encode user features and item features into the same embedding space. At serving time, the user embedding is computed once, and an approximate nearest neighbor (ANN) search finds the closest item embeddings. This is the architecture Netflix, YouTube, and Meta use for candidate generation at scale.

Multiple generators run in parallel: A production system uses 5–20 candidate generators simultaneously—one for collaborative filtering, one for content similarity, one for trending content, one for items from subscribed channels, one for geographic popularity. Results are merged and deduplicated before passing to the ranking stage.

Interview application: "I would use a two-tower model for the primary candidate generator. The user tower encodes interaction history, demographics, and context into a 128-dimensional embedding. The item tower encodes content features, popularity signals, and freshness into the same space. At serving time, I perform an ANN search using FAISS to retrieve the 500 nearest items in under 10ms. I would also run a trending generator and a subscription-based generator in parallel for diversity."

Stage 2: Ranking

Purpose: From the 500 candidates, score each one with a more computationally expensive model to predict the probability that the user will engage (click, watch, purchase). Output: a ranked list of 50–100 items.

Model architecture: A deep learning model (typically a multi-layer neural network or gradient-boosted decision tree) takes rich features as input: user features (demographics, past interactions, session context), item features (content type, recency, popularity, quality score), and cross features (user-item interaction history, time since last interaction with similar content).

Feature store: A critical infrastructure component that serves precomputed features to the ranking model at inference time with single-digit millisecond latency. A feature store (Feast, Tecton) maintains two tables: an offline store (for model training with historical features) and an online store (for real-time serving with current features in Redis or DynamoDB).

Interview application: "The ranking model is a two-layer neural network with 256 hidden units, trained on click and watch-time labels. It receives 150 features per user-item pair from the feature store: 50 user features, 50 item features, and 50 cross features. The model predicts the probability of a 30-second watch. Inference on 500 candidates takes approximately 30ms on a GPU-backed serving instance."

Scenario	Problem	Solution
New user	No interaction history to personalize from	Popularity-based defaults, demographic-based preferences, onboarding survey ("Select genres you enjoy")
New item	No interaction data to rank with	Content-based features (genre, tags, description embeddings), editorial boosting, explore-slot allocation
New user + new item	No data on either side	Global popularity ranking, random exploration in designated slots

Component	Technology	Latency Budget
Candidate generation (ANN search)	FAISS, Milvus, or Pinecone	10–20ms
Feature retrieval	Redis, DynamoDB (feature store)	5–10ms
Ranking model inference	TensorFlow Serving, TorchServe (GPU)	20–30ms
Re-ranking and business rules	Application logic	5–10ms
Network overhead + serialization	gRPC between services	10–20ms
Total		50–90ms (within 200ms budget)

How to design a personalized content recommendation engine

Key Takeaways

Step 1: Requirements and Scope

Step 2: Back-of-Envelope Estimation

Step 3: The Three-Stage Recommendation Pipeline

Stage 1: Candidate Generation

Stage 2: Ranking

Stage 3: Re-Ranking

Step 4: Handling the Cold Start Problem

Step 5: Data Pipeline and Model Training

Step 6: Monitoring and Evaluation

Common Interview Follow-Up Questions

Frequently Asked Questions

What is the three-stage recommendation pipeline?

What is the difference between collaborative and content-based filtering?

How do you solve the cold start problem?

What is a feature store?

How fast should recommendations be served?

What is a two-tower model?

How do you measure recommendation quality?

How do you prevent filter bubbles?

What database should I use for the recommendation system?

Why do interviewers ask about recommendation systems?

TL;DR