Framework for approaching an ambiguous system design problem
A system design problem framework is a repeatable, step-by-step method for breaking down any open-ended architecture question into manageable pieces: scoping the problem, estimating scale, designing components, making trade-off decisions, and addressing failure modes.
The best frameworks turn ambiguity — which is deliberately baked into every system design interview — from a threat into an opportunity to demonstrate engineering judgment.
Key Takeaways
- Ambiguity in system design questions is intentional. The interviewer wants to see how you handle undefined requirements, not whether you've memorized a specific architecture.
- A strong framework has five phases: Scope, Estimate, Design, Deep Dive, and Evaluate. Each phase has a clear goal and exit condition.
- The framework is not a script. It's a thinking structure that keeps you organized while leaving room for the conversation to go wherever the interviewer steers it.
- The biggest differentiator between average and excellent candidates is trade-off articulation — the ability to name two valid options, compare them on concrete axes, and commit to one with reasoning.
- Every senior engineer who interviews well uses some version of this framework, whether they learned it explicitly or developed it through years of practice.
Why You Need a Framework for Ambiguous System Design Problems
When an interviewer says "Design Twitter," they haven't told you:
- Which features of Twitter? The timeline? Direct messages? Search? Trending topics? All of them?
- What scale? 10,000 users or 500 million?
- What matters more — consistency or availability?
- Is this a greenfield build or an extension of an existing system?
- What's the latency budget? The cost budget?
This ambiguity is the test.
The interviewer didn't forget to specify these details. They left them out to see whether you'll ask smart questions, state reasonable assumptions, and make decisions under uncertainty — the same skills you use daily as a working engineer.
Without a framework, most candidates do one of two things: freeze up and ask 15 minutes of clarifying questions, or rush to draw boxes without understanding what they're building. Both are failing patterns.
A good system design problem framework gives you a third option: move through the ambiguity systematically, making it smaller at each step until you're designing a well-defined system with clear constraints.
If you're building this skill from scratch, the Grokking System Design Fundamentals course teaches these foundational thinking patterns with structured, progressive exercises.
The Five-Phase Framework
This framework works for any system design problem — from "Design a URL shortener" to "Design YouTube's recommendation engine." The phases are sequential, but you'll often loop back as you learn more about the problem.
Phase 1: Scope and Requirements (5–7 Minutes)
Goal: Turn a vague prompt into a well-defined problem with clear boundaries.
This phase has two jobs. First, identify which features you're designing. Second, define the non-functional requirements that constrain your architecture.
Functional scoping means asking: what does this system actually do? For "Design Twitter," you might narrow to: (1) users post tweets, (2) users follow other users, (3) users see a home timeline of tweets from people they follow, and (4) users can search tweets. That's four features. You've explicitly excluded DMs, trending topics, ads, notifications, and media upload — and you've told the interviewer why.
The scoping question that separates strong candidates from average ones is: "What are the two or three most important use cases for this system?" This forces prioritization. You can't design everything in 45 minutes, and demonstrating that you know this is a signal of engineering maturity.
Non-functional requirements define the system's quality attributes. The critical ones for almost every system design problem are:
| Requirement | What to Ask | Why It Matters |
|---|---|---|
| Scale | How many users? How many requests per second? How much data? | Determines whether you need sharding, caching, CDNs |
| Latency | What's the acceptable response time? p50 vs p99? | Drives caching strategy, database choice, async vs sync |
| Availability | What's the uptime target? 99.9% vs 99.99%? | Determines replication strategy, failover design |
| Consistency | Is eventual consistency acceptable? | Drives database choice, cache invalidation, conflict resolution |
| Durability | Can we afford to lose data? | Determines write-ahead logging, replication factor |
How to exit this phase: You should have 3–5 functional requirements written down, 3–4 non-functional constraints quantified (even roughly), and the interviewer should have nodded or corrected your assumptions. If you haven't written anything down, you haven't finished scoping.
Example output for "Design a News Feed":
Functional: (1) Users create posts (text + images). (2) Users follow other users. (3) Users see a chronological feed of posts from followed users. (4) Feed supports pagination.
Non-functional: 500M users, 10M daily active. 100K new posts/day. Feed load latency < 200ms at p99. High availability (99.99%). Eventual consistency acceptable for feed (a few seconds of delay is fine).
That paragraph took 90 seconds to say out loud and now you have a design target.
Phase 2: Capacity Estimation (3–5 Minutes)
Goal: Put numbers on your constraints so your design decisions have quantitative backing.
Back-of-envelope math is not about getting exact numbers. It's about getting the right order of magnitude so you can make informed choices about storage, bandwidth, and whether you need horizontal scaling.
The three numbers that matter for most problems:
-
Queries per second (QPS). For a news feed with 10M daily active users, each loading their feed 10 times a day: 10M × 10 / 86,400 ≈ 1,150 QPS for feed reads. Manageable for a single database. But if each feed load requires fetching from 200 followed users? Now you need to think about pre-computation or caching.
-
Storage. 100K new posts per day × 1KB average post size = 100MB per day, or about 36GB per year. Small. But if each post has a 2MB image? Now it's 200GB per day, or 73TB per year. That changes your storage architecture entirely.
-
Bandwidth. If each feed response is 50KB (20 posts × 2.5KB each) and you serve 1,150 feeds per second, that's about 57MB/s outbound. Comfortable for a single server, but you'll want a CDN for images.
Common estimation shortcuts:
| Fact | Value | Useful For |
|---|---|---|
| Seconds in a day | ~86,400 (~100K for rough math) | Converting daily users to QPS |
| 1 million seconds | ~11.5 days | Relating time to scale |
| 1 billion seconds | ~31 years | Estimating long-term storage |
| 1 character | 1 byte (ASCII) / 2 bytes (UTF-8 avg) | Text storage |
| SSD random read | ~100 microseconds | Latency budgets |
| Network round trip (same data center) | ~0.5ms | Multi-service latency |
| Network round trip (cross-continent) | ~150ms | Multi-region design |
How to exit this phase: You should know whether your system is read-heavy or write-heavy, whether a single machine can handle the load or you need horizontal scaling, and roughly how much storage you need. These numbers directly feed into your design decisions in Phase 3.
Phase 3: High-Level Design (10–12 Minutes)
Goal: Produce a coherent architecture diagram where every component earns its place.
Start from the user and trace the request path. Draw the simplest possible system that satisfies your requirements, then add complexity only where the numbers from Phase 2 demand it.
The universal starting skeleton:
Client → Load Balancer → API Gateway → Application Service(s) → Database
Every system starts here. Then you ask: what does the load from Phase 2 tell me I need to add?
- Read-heavy system (100:1 read-to-write ratio)? Add a cache layer (Redis, Memcached) between the application and database. Add a CDN for static content.
- Write-heavy system? Add a message queue (Kafka, SQS) to absorb write spikes and process them asynchronously.
- Large-scale data? Partition the database. Choose a partitioning strategy (hash-based, range-based) based on access patterns.
- Low-latency requirement? Pre-compute results and store them. For a news feed, this means fanout-on-write: when a user posts, push the post to all followers' pre-built feeds.
The most common mistake in this phase is adding components without justification. If you draw a Kafka box, you need to explain what problem it solves. "I'm adding Kafka because write spikes from viral posts could exceed our database's write throughput. The queue buffers those writes and lets us process them at a sustainable rate." That's a justified component. "I'm adding Kafka because distributed systems use message queues" is a memorized pattern, not an engineering decision.
API design tip: Before or while drawing your architecture, sketch 2–3 key API endpoints. This forces clarity about what the system actually does.
POST /v1/posts → Create a new post
GET /v1/feed?user_id=X&cursor=Y → Get paginated feed
POST /v1/follow → Follow a user
Defining these endpoints also signals to the interviewer that you think about systems from the user's perspective, not just from the infrastructure perspective.
Phase 4: Deep Dive (12–18 Minutes)
Goal: Demonstrate technical depth on 1–2 critical components by exploring data models, algorithms, trade-offs, and failure modes.
This phase is where you win or lose the interview. The high-level design shows you can think architecturally. The deep dive shows you can build.
You won't get to choose which area the interviewer probes — but you can influence it. During Phase 3, flag the interesting trade-offs: "There's a key decision in how we handle feed generation — push vs pull — that I'd like to explore." Most interviewers will follow your lead if the topic is genuinely meaty.
How to structure a deep dive:
Step 1: State the options. "For feed generation, we have two approaches. Fanout-on-write: when a user posts, we immediately push the post into every follower's feed cache. Fanout-on-read: when a user opens their feed, we pull the latest posts from all followed users in real time."
Step 2: Compare on concrete axes. Don't say "it depends." Compare on specific, measurable dimensions:
| Dimension | Fanout-on-Write (Push) | Fanout-on-Read (Pull) |
|---|---|---|
| Feed read latency | Very low (pre-computed) | Higher (computed on request) |
| Write cost | High for users with millions of followers | Low (no write amplification) |
| Storage | More (every follower stores a copy) | Less (no pre-computed feeds) |
| Staleness | Feed always up-to-date | Feed is real-time by definition |
| Complexity | Higher (need to manage fan-out jobs) | Simpler write path |
Step 3: Make a decision and justify it. "For most users, fanout-on-write gives us the sub-200ms feed latency we need. But for celebrity users with 10M+ followers, a single post would trigger 10 million write operations — that's unsustainable. So I'd use a hybrid: fanout-on-write for normal users, fanout-on-read for users above a follower threshold, say 100,000. Twitter actually uses this hybrid model in production."
That answer demonstrates three things interviewers value: you know both approaches, you compared them on real axes, and you made a practical decision that handles the hard edge case. The specific reference to Twitter's production architecture adds credibility.
Step 4: Address failure modes. "What happens if the fanout job fails halfway through? We'd process fans in batches with checkpointing. If a batch fails, we retry from the last checkpoint. Some followers might see the post a few seconds late, which is within our eventual consistency requirement."
Data model design is another common deep-dive target. For a news feed:
users: { user_id (PK), username, created_at }
posts: { post_id (PK), author_id (FK), content, media_url, created_at }
follows: { follower_id, followee_id, created_at } — composite PK
feed: { user_id, post_id, created_at } — pre-computed feed entries
State why you chose each structure. The feed table exists because you chose fanout-on-write. The follows table uses a composite primary key because the primary query is "does user A follow user B?" — a point lookup, not a scan.
Phase 5: Evaluate and Extend (3–5 Minutes)
Goal: Show engineering maturity by identifying bottlenecks, monitoring gaps, and future extensions.
This phase is your closing argument. Address three things:
Bottlenecks: "The fan-out service is our biggest bottleneck. A celebrity posting could generate millions of write operations. I'd monitor the fan-out queue depth and auto-scale workers based on queue lag."
Monitoring: "I'd track feed latency at p50/p99, fan-out job completion time, cache hit rate, and database replication lag. If cache hit rate drops below 95%, that's an early warning that our pre-computation strategy isn't covering enough requests."
Extensions the interviewer might ask about:
- "How would you add real-time notifications?" → WebSocket connections with a pub/sub layer.
- "How would you handle multi-region?" → Leader-follower replication with regional read replicas; writes route to the nearest leader.
- "How would you add content ranking?" → Replace chronological feed with a scoring model; pre-compute scores during fan-out.
You won't fully design these extensions. The interviewer wants to see that you can identify the next set of challenges and have a directional answer for each.
The Framework in Action: Complete Walkthrough
Here's the framework applied to "Design a Rate Limiter" in compressed form:
Phase 1 — Scope: Rate limiter for an API gateway. Limit by user ID. Return HTTP 429 when exceeded. Rules: 100 requests per minute per user. Must work across distributed servers (not per-server limiting). Must add less than 5ms of latency to each request.
Phase 2 — Estimate: 10M active users, average 20 requests/minute each = ~3.3M requests/second total. Counter storage: 10M users × 16 bytes per counter = ~160MB. Fits in memory on a single Redis instance.
Phase 3 — Design: API Gateway → Rate Limiter Service → Redis (counter store) → Backend Services. Each request checks Redis before forwarding. If the counter exceeds 100, return 429. Use a sliding window algorithm for smooth limiting.
Phase 4 — Deep Dive: Compare rate limiting algorithms:
| Algorithm | Accuracy | Memory | Complexity | Used By |
|---|---|---|---|---|
| Fixed Window Counter | Low (burst at window edges) | Very Low | Simple | Basic APIs |
| Sliding Window Log | High | High (stores every timestamp) | Medium | Stripe |
| Sliding Window Counter | High (approximated) | Low | Medium | Cloudflare |
| Token Bucket | High | Low | Medium | AWS API Gateway, Stripe |
| Leaky Bucket | Smoothest output | Low | Medium | Network traffic shaping |
Decision: Token bucket. It handles burst traffic naturally (bucket can accumulate tokens), uses minimal memory (two numbers per user: token count and last refill timestamp), and is the industry standard — AWS API Gateway and Stripe both use it.
Phase 5 — Evaluate: Single point of failure is Redis. Solution: Redis Cluster with automatic failover. Race condition on counter updates: use Redis MULTI/EXEC or Lua scripts for atomic operations. Extension: add per-endpoint rate limiting, not just per-user.
How the Framework Changes by Seniority Level
The same five phases apply at every level. What changes is the depth and breadth of what's expected in each phase.
| Phase | Junior (L3–L4) | Senior (L5) | Staff+ (L6+) |
|---|---|---|---|
| Scope | Ask basic functional questions | Identify the hardest constraints proactively | Challenge assumptions, reframe the problem |
| Estimate | Know the formulas, get the magnitude right | Derive estimates from stated requirements without prompting | Use estimates to pre-emptively eliminate naive approaches |
| Design | Produce a reasonable architecture with basic justification | Justify every component with quantitative reasoning | Design for organizational concerns (team ownership, deploy boundaries) |
| Deep Dive | Explain how components work | Derive solutions from first principles, analyze failure modes | Propose novel approaches, reason about cross-system impacts |
| Evaluate | Mention that monitoring is important | Specify concrete metrics and alerting thresholds | Identify operational risks, rollout strategies, migration paths |
At the senior level and above, you should be driving the conversation with minimal prompting. The interviewer shouldn't have to say "what about failure modes?" — you should address them before being asked. For structured practice at this depth, the Grokking the Advanced System Design Interview course covers complex scenarios where the framework must flex to handle multi-system dependencies and organizational constraints.
Trade-Off Analysis: The Core Skill Inside the Framework
If there's one skill that the framework is designed to exercise, it's trade-off analysis. Every phase requires decisions, and every decision involves trade-offs.
A weak trade-off analysis sounds like: "We could use SQL or NoSQL. I'll use NoSQL because it scales better."
A strong trade-off analysis sounds like: "Our access pattern is key-value lookups with no joins — that favors a NoSQL store like DynamoDB. But we also need to query posts by timestamp within a user's feed, which is a range query. DynamoDB supports this with sort keys on the partition. If we needed complex relational queries — say, for an analytics dashboard — I'd add a separate OLAP store rather than forcing the read path into a relational model."
The difference: the strong version names the access pattern, explains why it maps to a specific technology, acknowledges the limitation, and proposes how to handle the edge case. This is what interviewers mean when they say they're evaluating "trade-off thinking."
Framework for any trade-off decision:
- Name the decision point explicitly ("I need to choose between X and Y").
- State the axes that matter for this specific problem (latency, cost, complexity, consistency).
- Compare the options on those axes with specifics.
- Make a choice and state the one-line reason.
- Acknowledge what you're giving up.
If you apply this five-step pattern to every decision in your design, your interview answer will be dramatically stronger than most candidates — because most candidates just state their choice without the comparison.
Handling Curveballs Within the Framework
The framework handles curveballs by design. When an interviewer throws a wrench — "Now assume the system needs to work across three continents" or "What if your primary database goes down?" — you don't need a new framework. You apply the same phases:
- Re-scope: "Multi-region changes our consistency model. I'll assume eventual consistency across regions is acceptable, with strong consistency within a region."
- Re-estimate: "Cross-continent latency is ~150ms. If a user in Europe hits a US database, we can't meet our 200ms latency budget. We need regional read replicas."
- Re-design: Adjust the architecture to include regional deployments.
- Deep dive into the new constraint: conflict resolution, replication lag, failover routing.
- Evaluate the new failure modes introduced by the multi-region setup.
The framework doesn't prevent curveballs. It gives you a systematic way to absorb and respond to them without losing your composure or your structure. The Ultimate System Design Interview Guide (2026) includes walkthroughs of exactly these kinds of mid-interview pivots across 20+ practice problems.
Anti-Patterns: What the Framework Protects You From
The Kitchen Sink. Drawing every component you know — Kafka, Redis, Elasticsearch, a GraphQL gateway, a service mesh — without justifying any of them. The framework forces justification in Phase 3 because every component must trace back to a requirement from Phase 1 or a number from Phase 2.
The Monologue. Talking for 20 minutes straight without checking in. The framework's phase transitions are natural checkpoints to ask the interviewer: "Does this scope make sense?" (end of Phase 1), "Should I go deeper on the data model or the caching layer?" (start of Phase 4).
The Perfect Design. Trying to solve every edge case instead of making forward progress. The framework pushes you to state assumptions, move on, and revisit only if the interviewer asks.
The Blank Stare. Not knowing where to start when the question is vague. The framework gives you a first move that always works: "Let me start by understanding the requirements." That sentence has never been wrong.
The Memorized Solution. Reciting a design you've seen before without adapting it to the current constraints. The framework forces adaptation because Phase 1 and Phase 2 produce different numbers and priorities for every problem, even problems with the same title.
FAQ: System Design Problem Framework
What is the best framework for system design interviews?
The most widely recommended framework follows five phases: scope the problem by clarifying requirements, estimate capacity to quantify constraints, design a high-level architecture with justified components, deep dive into 1–2 critical areas with trade-off analysis, and evaluate by identifying bottlenecks and extensions. This structure works for any problem because it separates the thinking process from the specific domain.
How do you approach an ambiguous system design question?
Start by acknowledging the ambiguity explicitly, then reduce it systematically. Ask 3–5 targeted questions to define functional scope (what does the system do?) and non-functional constraints (at what scale, latency, and availability?). State your assumptions out loud and write them down. This transforms an ambiguous question into a well-defined design target within the first five minutes.
How long should I spend on requirements in a system design interview?
Spend 5–7 minutes in a 45-minute session and 7–10 minutes in a 60-minute session. The goal is to define enough scope to start designing confidently, not to eliminate all ambiguity. If you're still asking questions at minute 10, you're likely overthinking — state your assumptions and move to the high-level design.
What are the most common system design trade-offs interviewers test?
The six trade-offs that appear in nearly every system design interview are: consistency vs. availability, latency vs. throughput, read optimization vs. write optimization, SQL vs. NoSQL, push vs. pull architectures, and monolith vs. microservices. For each, you should be able to name when each side is preferable and give a real-world example.
How do you handle it when the interviewer redirects you mid-design?
Treat redirects as new constraints, not corrections. Re-scope the specific area they're pointing to, update your estimates if the numbers change, and adjust your design. The five-phase framework is recursive — you can apply it at any granularity. A redirect to "go deeper on the caching layer" means: scope the cache requirements, estimate cache hit rates and memory needs, design the cache topology, and analyze trade-offs.
Should I always do back-of-envelope calculations in system design interviews?
Yes, but keep them quick — 2–3 minutes maximum. You don't need precise numbers. You need the right order of magnitude to justify decisions. "We need roughly 10,000 QPS, which a single Postgres instance can handle" is useful. "We need exactly 11,574 QPS" is wasted precision. The goal is to make data-driven design choices, not to demonstrate arithmetic.
What if I don't know a technology the interviewer asks about?
Be honest: "I haven't worked with that specific technology, but based on what I know about similar systems, here's how I'd reason about it." Then apply first principles. If the interviewer asks about Cassandra and you've only used DynamoDB, explain the properties you'd need (wide-column store, tunable consistency, partition-based distribution) and reason from there. Honesty plus first-principles reasoning always beats a bluff.
How do I practice the framework effectively?
Pick a system design problem, set a 45-minute timer, and work through all five phases — speaking out loud as if you're in an interview. Record yourself. Afterward, compare your design to published solutions and note which phases felt weakest. Practice that specific phase on the next problem. After 10–15 problems, the framework should feel automatic.
Does this framework work for object-oriented design interviews too?
The scoping and trade-off phases transfer directly. The design phase shifts from architecture diagrams to class diagrams and API contracts, and the deep dive focuses on design patterns and SOLID principles rather than distributed systems concepts. The core thinking structure — reduce ambiguity, make justified decisions, evaluate trade-offs — is universal.
How does the framework differ for senior vs. junior candidates?
Junior candidates use the framework as a guide: the phases tell them what to do next. Senior candidates use it as a scaffold: the phases ensure they don't miss anything, but they spend less time on basics (scoping, estimation) and more time on depth (novel trade-offs, failure analysis, organizational concerns). Staff-level candidates often reshape the framework itself — reframing the problem before designing, or proposing multiple architectures and comparing them.
TL;DR
The system design problem framework has five phases: (1) Scope — clarify functional and non-functional requirements in 5–7 minutes; (2) Estimate — do back-of-envelope math on QPS, storage, and bandwidth; (3) Design — draw a high-level architecture where every component traces back to a requirement; (4) Deep Dive — go deep on 1–2 components with structured trade-off analysis (state options, compare on concrete axes, decide, acknowledge what you're giving up); (5) Evaluate — identify bottlenecks, monitoring needs, and extensions. Spend 35–40% of your time on the deep dive. The framework's core purpose is turning ambiguity into structured decisions — the exact skill system design interviews test.
Further Reading
- Martin Kleppmann, Designing Data-Intensive Applications — Chapter 1 covers the foundational trade-offs (reliability, scalability, maintainability) that this framework operationalizes.
- Google SRE Book, Chapter 3: Embracing Risk — how Google quantifies availability targets, directly applicable to the Scope phase.
- AWS Well-Architected Framework — Amazon's production framework for evaluating architectures, structured around pillars that map to Phase 5 (Evaluate).
GET YOUR FREE
Coding Questions Catalog

$197

$72

$78