Framework for approaching an ambiguous system design problem

Question

Design Gurus · Accepted Answer

A system design problem framework is a repeatable, step-by-step method for breaking down any open-ended architecture question into manageable pieces: scoping the problem, estimating scale, designing components, making trade-off decisions, and addressing failure modes.

The best frameworks turn ambiguity — which is deliberately baked into every system design interview — from a threat into an opportunity to demonstrate engineering judgment.

Key Takeaways

Ambiguity in system design questions is intentional. The interviewer wants to see how you handle undefined requirements, not whether you've memorized a specific architecture.  
A strong framework has five phases: Scope, Estimate, Design, Deep Dive, and Evaluate. Each phase has a clear goal and exit condition.  
The framework is not a script. It's a thinking structure that keeps you organized while leaving room for the conversation to go wherever the interviewer steers it.  
The biggest differentiator between average and excellent candidates is trade-off articulation — the ability to name two valid options, compare them on concrete axes, and commit to one with reasoning.  
Every senior engineer who interviews well uses some version of this framework, whether they learned it explicitly or developed it through years of practice.

Why You Need a Framework for Ambiguous System Design Problems

When an interviewer says "Design Twitter," they haven't told you:

Which features of Twitter? The timeline? Direct messages? Search? Trending topics? All of them?  
What scale? 10,000 users or 500 million?  
What matters more — consistency or availability?  
Is this a greenfield build or an extension of an existing system?  
What's the latency budget? The cost budget?

This ambiguity is the test.

The interviewer didn't forget to specify these details. They left them out to see whether you'll ask smart questions, state reasonable assumptions, and make decisions under uncertainty — the same skills you use daily as a working engineer.

Without a framework, most candidates do one of two things: freeze up and ask 15 minutes of clarifying questions, or rush to draw boxes without understanding what they're building. Both are failing patterns.

A good system design problem framework gives you a third option: move through the ambiguity systematically, making it smaller at each step until you're designing a well-defined system with clear constraints.

If you're building this skill from scratch, the Grokking System Design Fundamentals course teaches these foundational thinking patterns with structured, progressive exercises.

The Five-Phase Framework

This framework works for any system design problem — from "Design a URL shortener" to "Design YouTube's recommendation engine." The phases are sequential, but you'll often loop back as you learn more about the problem.

Phase 1: Scope and Requirements (5–7 Minutes)

Goal: Turn a vague prompt into a well-defined problem with clear boundaries.

This phase has two jobs. First, identify which features you're designing. Second, define the non-functional requirements that constrain your architecture.

Functional scoping means asking: what does this system actually do? For "Design Twitter," you might narrow to: (1) users post tweets, (2) users follow other users, (3) users see a home timeline of tweets from people they follow, and (4) users can search tweets. That's four features. You've explicitly excluded DMs, trending topics, ads, notifications, and media upload — and you've told the interviewer why.

The scoping question that separates strong candidates from average ones is: "What are the two or three most important use cases for this system?" This forces prioritization. You can't design everything in 45 minutes, and demonstrating that you know this is a signal of engineering maturity.

Non-functional requirements define the system's quality attributes. The critical ones for almost every system design problem are:

Requirement What to Ask Why It Matters
Scale How many users? How many requests per second? How much data? Determines whether you need sharding, caching, CDNs
Latency What's the acceptable response time? p50 vs p99? Drives caching strategy, database choice, async vs sync
Availability What's the uptime target? 99.9% vs 99.99%? Determines replication strategy, failover design
Consistency Is eventual consistency acceptable? Drives database choice, cache invalidation, conflict resolution
Durability Can we afford to lose data? Determines write-ahead logging, replication factor

How to exit this phase: You should have 3–5 functional requirements written down, 3–4 non-functional constraints quantified (even roughly), and the interviewer should have nodded or corrected your assumptions. If you haven't written anything down, you haven't finished scoping.

Example output for "Design a News Feed":

Functional: (1) Users create posts (text + images). (2) Users follow other users. (3) Users see a chronological feed of posts from followed users. (4) Feed supports pagination.
 Non-functional: 500M users, 10M daily active. 100K new posts/day. Feed load latency < 200ms at p99. High availability (99.99%). Eventual consistency acceptable for feed (a few seconds of delay is fine).

That paragraph took 90 seconds to say out loud and now you have a design target.

Phase 2: Capacity Estimation (3–5 Minutes)

Goal: Put numbers on your constraints so your design decisions have quantitative backing.

Back-of-envelope math is not about getting exact numbers. It's about getting the right order of magnitude so you can make informed choices about storage, bandwidth, and whether you need horizontal scaling.

The three numbers that matter for most problems:

Queries per second (QPS). For a news feed with 10M daily active users, each loading their feed 10 times a day: 10M × 10 / 86,400 ≈ 1,150 QPS for feed reads. Manageable for a single database. But if each feed load requires fetching from 200 followed users? Now you need to think about pre-computation or caching.

Storage. 100K new posts per day × 1KB average post size = 100MB per day, or about 36GB per year. Small. But if each post has a 2MB image? Now it's 200GB per day, or 73TB per year. That changes your storage architecture entirely.

Bandwidth. If each feed response is 50KB (20 posts × 2.5KB each) and you serve 1,150 feeds per second, that's about 57MB/s outbound. Comfortable for a single server, but you'll want a CDN for images.

Common estimation shortcuts:

Fact Value Useful For
Seconds in a day ~86,400 (~100K for rough math) Converting daily users to QPS
1 million seconds ~11.5 days Relating time to scale
1 billion seconds ~31 years Estimating long-term storage
1 character 1 byte (ASCII) / 2 bytes (UTF-8 avg) Text storage
SSD random read ~100 microseconds Latency budgets
Network round trip (same data center) ~0.5ms Multi-service latency
Network round trip (cross-continent) ~150ms Multi-region design

How to exit this phase: You should know whether your system is read-heavy or write-heavy, whether a single machine can handle the load or you need horizontal scaling, and roughly how much storage you need. These numbers directly feed into your design decisions in Phase 3.

Phase 3: High-Level Design (10–12 Minutes)

Goal: Produce a coherent architecture diagram where every component earns its place.

Start from the user and trace the request path. Draw the simplest possible system that satisfies your requirements, then add complexity only where the numbers from Phase 2 demand it.

The universal starting skeleton:

Client → Load Balancer → API Gateway → Application Service(s) → Database

Every system starts here. Then you ask: what does the load from Phase 2 tell me I need to add?

Read-heavy system (100:1 read-to-write ratio)? Add a cache layer (Redis, Memcached) between the application and database. Add a CDN for static content.  
Write-heavy system? Add a message queue (Kafka, SQS) to absorb write spikes and process them asynchronously.  
Large-scale data? Partition the database. Choose a partitioning strategy (hash-based, range-based) based on access patterns.  
Low-latency requirement? Pre-compute results and store them. For a news feed, this means fanout-on-write: when a user posts, push the post to all followers' pre-built feeds.

The most common mistake in this phase is adding components without justification. If you draw a Kafka box, you need to explain what problem it solves. "I'm adding Kafka because write spikes from viral posts could exceed our database's write throughput. The queue buffers those writes and lets us process them at a sustainable rate." That's a justified component. "I'm adding Kafka because distributed systems use message queues" is a memorized pattern, not an engineering decision.

API design tip: Before or while drawing your architecture, sketch 2–3 key API endpoints. This forces clarity about what the system actually does.

POST /v1/posts          → Create a new post
GET  /v1/feed?user_id=X&cursor=Y  → Get paginated feed
POST /v1/follow         → Follow a user

Defining these endpoints also signals to the interviewer that you think about systems from the user's perspective, not just from the infrastructure perspective.

Phase 4: Deep Dive (12–18 Minutes)

Goal: Demonstrate technical depth on 1–2 critical components by exploring data models, algorithms, trade-offs, and failure modes.

This phase is where you win or lose the interview. The high-level design shows you can think architecturally. The deep dive shows you can build.

You won't get to choose which area the interviewer probes — but you can influence it. During Phase 3, flag the interesting trade-offs: "There's a key decision in how we handle feed generation — push vs pull — that I'd like to explore." Most interviewers will follow your lead if the topic is genuinely meaty.

How to structure a deep dive:

Step 1: State the options. "For feed generation, we have two approaches. Fanout-on-write: when a user posts, we immediately push the post into every follower's feed cache. Fanout-on-read: when a user opens their feed, we pull the latest posts from all followed users in real time."

Step 2: Compare on concrete axes. Don't say "it depends." Compare on specific, measurable dimensions:

Dimension Fanout-on-Write (Push) Fanout-on-Read (Pull)
Feed read latency Very low (pre-computed) Higher (computed on request)
Write cost High for users with millions of followers Low (no write amplification)
Storage More (every follower stores a copy) Less (no pre-computed feeds)
Staleness Feed always up-to-date Feed is real-time by definition
Complexity Higher (need to manage fan-out jobs) Simpler write path

Step 3: Make a decision and justify it. "For most users, fanout-on-write gives us the sub-200ms feed latency we need. But for celebrity users with 10M+ followers, a single post would trigger 10 million write operations — that's unsustainable. So I'd use a hybrid: fanout-on-write for normal users, fanout-on-read for users above a follower threshold, say 100,000. Twitter actually uses this hybrid model in production."

That answer demonstrates three things interviewers value: you know both approaches, you compared them on real axes, and you made a practical decision that handles the hard edge case. The specific reference to Twitter's production architecture adds credibility.

Step 4: Address failure modes. "What happens if the fanout job fails halfway through? We'd process fans in batches with checkpointing. If a batch fails, we retry from the last checkpoint. Some followers might see the post a few seconds late, which is within our eventual consistency requirement."

Data model design is another common deep-dive target. For a news feed:

users:    { user_id (PK), username, created_at }
posts:    { post_id (PK), author_id (FK), content, media_url, created_at }
follows:  { follower_id, followee_id, created_at } — composite PK
feed:     { user_id, post_id, created_at } — pre-computed feed entries

State why you chose each structure. The feed table exists because you chose fanout-on-write. The follows table uses a composite primary key because the primary query is "does user A follow user B?" — a point lookup, not a scan.

Algorithm	Accuracy	Memory	Complexity	Used By
Fixed Window Counter	Low (burst at window edges)	Very Low	Simple	Basic APIs
Sliding Window Log	High	High (stores every timestamp)	Medium	Stripe
Sliding Window Counter	High (approximated)	Low	Medium	Cloudflare
Token Bucket	High	Low	Medium	AWS API Gateway, Stripe
Leaky Bucket	Smoothest output	Low	Medium	Network traffic shaping

Phase	Junior (L3–L4)	Senior (L5)	Staff+ (L6+)
Scope	Ask basic functional questions	Identify the hardest constraints proactively	Challenge assumptions, reframe the problem
Estimate	Know the formulas, get the magnitude right	Derive estimates from stated requirements without prompting	Use estimates to pre-emptively eliminate naive approaches
Design	Produce a reasonable architecture with basic justification	Justify every component with quantitative reasoning	Design for organizational concerns (team ownership, deploy boundaries)
Deep Dive	Explain how components work	Derive solutions from first principles, analyze failure modes	Propose novel approaches, reason about cross-system impacts
Evaluate	Mention that monitoring is important	Specify concrete metrics and alerting thresholds	Identify operational risks, rollout strategies, migration paths

Framework for approaching an ambiguous system design problem

Key Takeaways

Why You Need a Framework for Ambiguous System Design Problems

The Five-Phase Framework

Phase 1: Scope and Requirements (5–7 Minutes)

Phase 2: Capacity Estimation (3–5 Minutes)

Phase 3: High-Level Design (10–12 Minutes)

Phase 4: Deep Dive (12–18 Minutes)

Phase 5: Evaluate and Extend (3–5 Minutes)

The Framework in Action: Complete Walkthrough

How the Framework Changes by Seniority Level

Trade-Off Analysis: The Core Skill Inside the Framework

Handling Curveballs Within the Framework

Anti-Patterns: What the Framework Protects You From

FAQ: System Design Problem Framework

What is the best framework for system design interviews?

How do you approach an ambiguous system design question?

How long should I spend on requirements in a system design interview?

What are the most common system design trade-offs interviewers test?

How do you handle it when the interviewer redirects you mid-design?

Should I always do back-of-envelope calculations in system design interviews?

What if I don't know a technology the interviewer asks about?

How do I practice the framework effectively?

Does this framework work for object-oriented design interviews too?

How does the framework differ for senior vs. junior candidates?

TL;DR

Further Reading

Requirement	What to Ask	Why It Matters
Scale	How many users? How many requests per second? How much data?	Determines whether you need sharding, caching, CDNs
Latency	What's the acceptable response time? p50 vs p99?	Drives caching strategy, database choice, async vs sync
Availability	What's the uptime target? 99.9% vs 99.99%?	Determines replication strategy, failover design
Consistency	Is eventual consistency acceptable?	Drives database choice, cache invalidation, conflict resolution
Durability	Can we afford to lose data?	Determines write-ahead logging, replication factor

Fact	Value	Useful For
Seconds in a day	~86,400 (~100K for rough math)	Converting daily users to QPS
1 million seconds	~11.5 days	Relating time to scale
1 billion seconds	~31 years	Estimating long-term storage
1 character	1 byte (ASCII) / 2 bytes (UTF-8 avg)	Text storage
SSD random read	~100 microseconds	Latency budgets
Network round trip (same data center)	~0.5ms	Multi-service latency
Network round trip (cross-continent)	~150ms	Multi-region design

Dimension	Fanout-on-Write (Push)	Fanout-on-Read (Pull)
Feed read latency	Very low (pre-computed)	Higher (computed on request)
Write cost	High for users with millions of followers	Low (no write amplification)
Storage	More (every follower stores a copy)	Less (no pre-computed feeds)
Staleness	Feed always up-to-date	Feed is real-time by definition
Complexity	Higher (need to manage fan-out jobs)	Simpler write path