How long is a system design interview?

Most system design interviews run 45 to 60 minutes. The first 5 to 10 minutes are clarifying requirements, the next 25 to 35 minutes are the actual design and deep-dives, and the last 5 minutes are wrap-up and candidate questions.

How do I prepare for a system design interview in 2026?

Follow a structured approach: learn the fundamentals (caching, sharding, load balancing, replication), master a 4-step framework (clarify, data/API, high-level design, deep dive), practice 8-10 classic problems across four question categories, and do mock interviews with real engineers. Eight weeks of consistent prep is sufficient for most engineers. The 2026 rubric now also requires cost reasoning, operational maturity, and AI-aware design knowledge.

What are the most common system design interview questions in 2026?

Common questions fall into four categories: Classic product designs (Design Twitter, Design YouTube, Design Uber, Design WhatsApp), Infrastructure designs (Rate Limiter, Key-Value Store, URL Shortener, Notification System), AI-adjacent designs (RAG Service, Vector Search, LLM-Powered Recommendation Feed), and Correctness/Operational designs (Payment Processing, Multi-Tenant Analytics, Distributed Job Scheduler).

What do interviewers look for in a system design interview?

Interviewers grade four things: Judgment (clarifying questions, tradeoff defense, committing to choices), Depth (going 3 layers deep on 2-3 components), Operational Maturity (observability, deployment strategy, cost reasoning), and Communication (handling pushback, checking in, treating the interview as a conversation). The weight of each shifts by level: senior loops weight judgment and depth highest; staff and principal shift toward operational maturity and communication.

Do I need distributed-systems experience to pass a system design interview?

No. What you need is conceptual fluency: understanding the patterns, knowing the tradeoffs, and being able to reason through scale and failure modes. Many engineers who have distributed-systems experience still fail because they haven't practiced the interview format specifically. The interview tests your ability to think systematically under time pressure, not your production experience.

What is the best system design interview course or resource?

Start with a free guide like this one for the framework and concepts. For structured practice with worked solutions, Grokking the System Design Interview by DesignGurus.io is a widely recommended course that covers 25+ real interview problems. For deeper reading, Designing Data-Intensive Applications by Martin Kleppmann and System Design Interview by Alex Xu are excellent supplements. The highest-leverage activity, regardless of resources, is mock interviews with engineers who currently interview at your target level.

How long does it take to prepare for a system design interview?

Eight weeks is a reasonable timeline for engineers without prior system design interview experience, dedicating about 1-2 hours per day. Two weeks is enough for engineers who have done system design interviews before and just need a refresher. If you only have a weekend, focus on the framework, the 2026 rubric changes, and one or two practice problems. The single highest-leverage activity in any timeframe is mock interviews.

Are AI and ML system design questions on the interview rubric in 2026?

Yes, increasingly. AI-adjacent questions (vector search, RAG services, LLM-serving infrastructure, recommendation systems with generative components) have moved from niche to mainstream. Even when the question is not AI-specific, surfacing awareness of the AI layer (embeddings, vector databases, semantic caching) is now treated as a seniority signal at many FAANG-tier companies.

Rate Limiting for System Design Interviews: Algorithms, Placement, and Distributed Counting

01Why Rate Limiting Is More Than an Algorithm Choice

Most candidates arrive at rate-limiting interviews with one piece of vocabulary: "I'd use token bucket." Then the interviewer asks where they'd enforce the limit, how they'd do it across a fleet of servers, and how they'd tell a legitimate spike from abuse. The token bucket answer covers maybe 10% of the topic. The other 90% is what separates senior candidates from mid-level ones.

The depth here lives in three areas. First, placement: where in the stack you enforce the limit, which determines what kinds of attacks and what kinds of users you can distinguish. Second, distributed counting: how multiple API servers share a counter without that counter becoming a bottleneck. Third, differentiation: telling a real user with high engagement from a bot, when they look identical at the rate-limiting layer.

This page covers all three. The algorithms are here too, but they're a setup for the placement and distribution discussion that follows. The senior framing is that algorithm choice is downstream of where you're enforcing and what you're trying to distinguish.

The Senior Move

The senior signal in rate limiting interviews isn't naming token bucket. It's recognizing that placement is the more consequential decision: edge vs API gateway vs service vs per-user. The algorithm follows from where you're enforcing. Naming the placement layers explicitly is what separates senior candidates from "I know token bucket" candidates.

02What Rate Limiting Actually Does

Rate limiting caps the number of requests a particular client (or category of clients) can make in a given window. The limit might be 100 requests per minute per user, 1000 per IP per hour, 10K per API key per day, or any combination. When a client exceeds the limit, the system rejects further requests until the window resets or refills.

Rate limiting solves four problems at once. The right configuration depends on which one you're optimizing for:

Capacity protection. Without limits, a single client (or a small number of bad ones) can consume disproportionate resources, degrading service for everyone else. Limits keep one user's traffic from breaking the system for others.
Cost control. Every request costs something: CPU, database load, downstream API calls. Limits cap the maximum cost a user can impose, which matters even more when downstream services charge per call (LLM APIs, payment processors, etc.).
Abuse prevention. Brute-force login attempts, scraping, denial-of-service. Limits make these attacks more expensive for the attacker by capping their request rate.
Tier enforcement. Different customers get different limits. Free tier gets 100 requests per hour; paid tier gets 10,000; enterprise gets custom limits. Rate limiting is how the business model becomes operational.

What rate limiting does not do, despite what some prep material implies:

Stop sophisticated attackers. A determined attacker spreads requests across thousands of IPs, rotates user agents, and stays under per-IP limits. Rate limiting is a speed bump, not a wall.
Replace authentication. Limits applied to anonymous traffic are cruder than limits applied per authenticated user. The two are complementary, not substitutable.
Solve all capacity problems. If your service is overloaded by legitimate users at fair rates, rate limiting is the wrong fix. You need more capacity or better caching, not stricter limits.

03The Four Algorithms

Four algorithms cover almost every production rate limiter. They differ in how they account for time and how forgiving they are of bursts. The diagram below shows each one's behavior visually; the cards below describe when to reach for each.

Four Rate-Limiting Algorithms

Token bucket allows bursts up to bucket size; leaky bucket smooths to a fixed rate. Fixed window is cheap but allows 2x bursts at boundaries; sliding window is smooth but more expensive to implement.

Token Bucket

Default for most APIs

The bucket holds up to N tokens. Tokens refill at a steady rate. Each request consumes one token; if no tokens, reject. Tokens carry over up to the bucket capacity, allowing bursts after quiet periods.

The most common production algorithm. Used by AWS API Gateway, Stripe, GitHub, and most rate-limiter-as-a-service offerings. Easy to reason about: "100 requests per minute, can burst to 200."

What it tradesAllows bursts up to bucket size, which is usually desirable but can be a problem if downstream systems can't absorb a burst. Tunable via the bucket size parameter.

Leaky Bucket

When downstream is rate-sensitive

Requests enter a queue (the bucket). The queue drains at a fixed rate, regardless of how fast requests arrive. If the queue is full, reject. Output is smooth: a steady drip of requests at the configured rate.

Used in network traffic shaping and anywhere downstream systems can't handle bursts. Less common in user-facing APIs because it can introduce latency (requests wait in the queue rather than being rejected immediately).

What it tradesSmooth output at the cost of queue latency. Doesn't allow bursts at all, which is sometimes the wrong tradeoff for user experience.

Fixed Window

Cheap and simple

Count requests per fixed time window (per minute, per hour). When the count exceeds the limit, reject. The counter resets at the start of the next window.

Easy to implement. Cheap to operate (just a counter and a window timestamp). Used heavily where simplicity matters more than precision: simple per-IP limits, basic abuse prevention.

What it tradesThe boundary problem: a client can send the full limit at the end of one window plus the full limit at the start of the next, achieving 2x the rate briefly. Acceptable for most uses, fatal for some.

Sliding Window

When precision matters

Track requests over a rolling time window (the last N seconds, continuously). When the count in the window exceeds the limit, reject. No fixed boundaries; the window moves forward continuously.

Most accurate of the four. Eliminates the boundary problem of fixed windows. Common in production through approximations: weighted hybrid of two adjacent fixed windows (Cloudflare's approach), or sliding window log (more expensive but exact).

What it tradesMore expensive to implement: needs to store per-request timestamps or weighted approximations. The accuracy is usually worth the cost at scale.

The interview move on algorithm choice

"Which algorithm would you use?" The strong response picks one and ties it to the workload. "Token bucket as the default for our public API: it allows bursts which match real user behavior, and it's well-understood operationally. Sliding window if we needed strict per-second guarantees, like for billing or DDoS protection. Fixed window only for simple per-IP throttling where the boundary spike doesn't matter." Three sentences, three workloads, three choices.

04Where to Enforce Limits: The Placement Layers

Algorithm choice is the part most candidates focus on. Placement is the part most interviewers actually probe. The same algorithm at different layers protects against different things and lets through different things. The senior framing names the layers explicitly.

Four Layers Where Rate Limiting Lives

Rate limits live at four layers, each protecting against a different threat. Edge limits stop crude DDoS; API gateway limits enforce tier; service limits cap per-endpoint cost; per-user limits enforce business rules. Most production systems run limits at all four.

Layer 1: Edge / CDN

Cloudflare, Fastly, AWS Shield

The first defense, applied before traffic reaches your infrastructure. Limits are crude (per-IP, per-region, per-ASN) because the edge has limited information about the user. Filters out the worst floods before they cost you anything.

Protects againstVolumetric DDoS, scraping, basic abuse. Cheapest layer because rejected requests never reach your servers.

Layer 2: API Gateway

Kong, ALB, Envoy, custom gateway

The gateway sees authenticated requests and can apply richer limits: per-API-key, per-tier, per-route. This is where business-model rate limits live: free tier vs paid tier, per-product limits, route-specific quotas.

Protects againstTier abuse, individual API-key overuse, route-specific overload. The most common single layer; many systems have only this one.

Layer 3: Service / Endpoint

Application code, sometimes a service mesh

Limits enforced inside the service itself, often per-endpoint or per-resource. The limits can be semantic: "no more than 10 expensive operations per user per minute" where "expensive" is something only the application knows.

Protects againstSpecific endpoint overload, expensive operations, abuse patterns the gateway can't see. Often pairs with the gateway rather than replacing it.

Layer 4: Per-user / Per-resource

Application logic + Redis or counters

The most fine-grained: limits tied to specific business semantics. "User can call our LLM endpoint 100 times per day on the free tier." "Each document can be edited by at most 50 collaborators per minute." "User can send 5 password reset emails per hour."

Protects againstBusiness-rule violations, cost runaway, social abuse patterns. The hardest to implement correctly; the most valuable when needed.

The interview move on placement

"Where would you enforce the limit?" The strong response names multiple layers. "Edge for DDoS protection, API gateway for per-API-key enforcement, and per-user limits in the application for the business-specific rules. Each layer protects against different threats; we need all of them." That sentence does the work. The weak response picks one layer and stops there.

05Distributed Rate Limiting: The Depth Probe

This is the question that catches mid-level candidates. Single-machine rate limiting is easy: keep a counter in memory, increment on each request. The interesting version: how do you rate limit across N API servers without the shared counter becoming a bottleneck?

Three approaches

Centralized counter (Redis)

Every API server reads and increments a counter in a shared Redis instance. Atomic operations (INCR) make this correct. Each request adds one round trip to Redis, which is fast (sub-millisecond) but not free at scale.

Pros: simple, accurate, well-understood. Pros are why it's the production default for most teams. Cons: every request takes a Redis hit; if Redis is slow or unavailable, your rate limiter degrades; the Redis instance can become a bottleneck at very high request rates.

Local counters with periodic sync

Each API server keeps its own local counter and reports to a central system periodically (every few seconds). The central system aggregates and pushes back the global state. Each server enforces locally based on the most recent global view.

Pros: no per-request network hop. Counters survive Redis outages. Cons: the limit is approximate, not exact. Bursts that happen between syncs go undetected. Acceptable for "soft" limits where some overshoot is OK; not acceptable for billing or hard quotas.

Allocate quota slices to each server

Divide the global limit into per-server allocations. If the global limit is 1000 requests per minute and you have 10 servers, each gets 100. Each server enforces its slice independently with no coordination.

Pros: zero coordination, maximum performance. Cons: extremely lossy when traffic is uneven across servers. Server A hits its quota and rejects users while Server B has spare capacity it never uses. Works well only when you have very even load distribution and many users (so no single user concentrates).

The Redis approach in practice

For most production systems, the centralized-Redis approach is the right default. The cost (one Redis call per request) is small. The accuracy is exact. Redis can handle hundreds of thousands of operations per second on a single instance, and clustering scales it further.

The depth probe within this approach: Redis can fail. What happens then? The two reasonable answers:

Fail open. If Redis is unavailable, allow all requests through. Prioritize availability; accept that you can't enforce limits during the outage. Standard for user-facing API rate limits where blocking everyone is worse than briefly allowing too much.
Fail closed. If Redis is unavailable, reject all requests until it recovers. Prioritize correctness; accept the user-visible outage. Standard for billing and hard quotas where briefly allowing unlimited usage is unacceptable.

Naming this choice explicitly is the staff signal. The weak response handwaves Redis as "always available." The strong response acknowledges that "Redis can fail; we'd fail open for our public API and fail closed for billing-critical endpoints."

The single-machine rate limiter is easy. The distributed version is where the interview goes. Naming Redis as the default and the fail-open vs fail-closed choice is the senior signal.

06Telling Legitimate Spikes from Abuse

Here's the hard problem: a real user who's having a great session looks identical to a bot at the rate-limiting layer. Both make many requests in a short time. Rate limiting alone cannot distinguish them. The senior signal is recognizing this gap and naming what fills it.

Three techniques to differentiate

Tiered limits

Authenticated users get higher limits than anonymous traffic. Paid users get higher limits than free users. Long-tenured accounts get higher limits than new ones. The limit becomes a function of trust, not just a flat ceiling.

This works because legitimate users self-select into authenticated and paid paths. Bots and abusers usually stay anonymous because authentication adds friction. The tiers don't perfectly distinguish, but they push the load toward the tiers that have higher limits and away from the anonymous tier where attacks concentrate.

Behavioral signals

Beyond raw request count, look at request patterns: is the user clicking through the UI like a human, or making API calls in a tight loop? Are they distributed across endpoints, or hammering one? Does the request pattern correlate with browser activity, or look automated?

This is where rate limiting blends into bot detection. Tools like Cloudflare Bot Management, AWS WAF, and similar services apply behavioral models that go far beyond simple counters. Build them in-house only if you have a specific reason; for most products, the managed service is the right answer.

Graduated response: captcha, slowdown, then block

Don't just allow or reject. When a user crosses the soft limit, present a captcha. If they pass, raise the limit. If they fail or skip it, slow down their requests (rate limit + add latency). Only block as the last resort.

The graduated response separates "we think you might be a bot" from "we're sure you're a bot." Real users can prove themselves; bots can't pass captchas reliably; the friction concentrates on the questionable cases without locking out clear humans.

The Senior Move

The senior signal here is recognizing that rate limiting alone can't distinguish a real user spike from abuse. Saying "we'd add tiered limits, behavioral signals from a bot detection service, and graduated responses (captcha before block) so we don't lock out legitimate users with high engagement" is the move. The weak response sets a flat limit and assumes everyone above it is malicious.

07What to Do When the Limit Is Hit

The other half of rate limiting: what happens to a request that exceeds the limit. The default answer is "reject with 429 Too Many Requests" but there are other options, each appropriate in different cases.

Three response strategies

Reject with 429. Return HTTP 429, ideally with a Retry-After header telling the client when they can try again. Standard for public APIs, easy for clients to handle correctly. The user-facing equivalent is showing "too many requests, please slow down."
Queue and process. Accept the request but process it later via a queue. The user gets an "accepted, processing" response and the work happens within the limit. Useful for async work where the response can be delayed without harming the user experience. The message queues deep-dive covers the queueing mechanics.
Degrade. Allow the request but return a degraded response. Lower image resolution, fewer results, no expensive features. Common in search and recommendation systems where you can serve a cheaper version under load.

The choice depends on the operation. Reject is right for everything user-facing where the alternative would be misleading. Queue is right for async work. Degrade is right when a partial response is genuinely better than nothing.

The Retry-After contract

When you reject, tell the client when to retry. The HTTP Retry-After header takes either a number of seconds or a specific date. Well-behaved clients respect this; badly-behaved clients ignore it. Either way, the contract is yours to communicate.

The interview move: name the header explicitly. "We'd return 429 with a Retry-After header indicating when the client can retry, plus a structured response body explaining which limit was hit and what they could do about it." That signals real-world API design experience.

08Failure Modes

Failure 01

The thundering herd at retry-time

You return 429 with Retry-After: 60. Thousands of clients all retry exactly 60 seconds later, in unison. The system is overwhelmed at retry-time even though it could have handled the gradual rate.

The fix is jitter: vary the retry timing per client. Either return a randomized Retry-After (each client gets a slightly different value), or instruct clients to add their own jitter (most well-behaved client libraries do this anyway). The pattern is exponential backoff with jitter, the standard for distributed-system retries.

Failure 02

Limits set by surface area, not by capacity

You set the rate limit to "100 requests per minute per user" because that's a round number. Then a viral moment doubles your traffic and the system melts down. You raise the limit to 200. Things stabilize. Nobody can articulate what the limit is actually protecting against.

The fix is to derive limits from system capacity, not from convention. If the database can handle 10K writes per second and you expect 1000 concurrent users, each user can write at 10/s sustained. Set the limit there. When you scale capacity, raise the limits. When you reduce capacity, lower them. The limits track reality.

Failure 03

Rate limiter outage breaks the whole API

Your rate limiter depends on Redis. Redis goes down. Every API call fails because the rate limiter can't decide whether to allow it. The whole API is offline because of a problem in the limiter, not in the actual service.

The fix is the fail-open vs fail-closed choice we covered in Section 5. For most user-facing APIs, fail open: a brief window of unlimited usage is better than a complete outage. For billing-critical endpoints, fail closed and accept the visible failure. Either way, decide explicitly so the failure mode isn't surprising.

Failure 04

Per-IP limits hit shared NAT users

You apply rate limits per IP. A bunch of users share a corporate NAT or mobile carrier IP. They all hit the same per-IP counter. Innocent users get rate-limited because of someone else on the same egress.

The fix is to apply per-IP limits only to anonymous traffic and use per-user (or per-API-key) limits for authenticated traffic. The per-IP layer becomes a first defense for anonymous abuse; the per-user layer takes over once the user authenticates. Mixing the two breaks down the moment shared IPs become common (every corporate network, every mobile carrier).

09How Rate Limiting Interacts With Other Concepts

Rate limiting × Load balancing. The LB is a natural enforcement point because it sees every request before backends do. Rejecting at the LB saves backend capacity. Most managed LBs (ALB, Cloudflare) include rate limiting features. The load balancing deep-dive covers the placement implications.
Rate limiting × Caching. Cache hits should not consume rate-limit budget. The rate limiter sits in front of the cache; cache hits return without ever entering the limited path. This is one reason why the placement layer matters. Caching covers the layering.
Rate limiting × Message queues. Queues are an alternative to rejection: instead of returning 429, accept the request and process it asynchronously. Useful when the work doesn't need to be immediate. Message queues covers the async pattern.
Rate limiting × Database selection. The Redis-as-rate-limiter pattern is itself a database choice: a high-throughput, low-latency key-value store with atomic counters. This is one of the canonical Redis use cases. Database selection covers when Redis fits.
Rate limiting × Observability. Rate-limit rejection is a metric you must monitor. A spike in 429s tells you something: either real abuse, a buggy client, or limits set too tight. The dedicated observability deep-dive covers how to instrument this.

For more cross-concept interactions, see the concepts library hub.

10Practice Scenarios

Three scenarios. Read the setup. Decide your approach before opening the reveal.

Scenario 01

A SaaS API needs to enforce different rate limits per tier (free, paid, enterprise) across 50 API servers. How do you architect this?

Free tier: 100 requests per minute. Paid: 1000 per minute. Enterprise: custom. Limits must be enforced exactly (no "best effort") because billing depends on it. 50 API servers, plans to scale to 200.

How to think about this

The answer has three layers: where to enforce, which algorithm, and how to coordinate across servers.

Where: API gateway. The gateway already authenticates and knows the user's tier. It's the natural place to apply tier-aware limits. Edge limits (per-IP) sit above this for DDoS, but the tier enforcement happens at the gateway.

Which algorithm: Token bucket. Allows reasonable bursts (which match real user behavior) while keeping the average rate enforced. The bucket size is a tunable: smaller = stricter, larger = more burst-tolerant.

How to coordinate: Centralized Redis. Each gateway server reads and increments the counter for the user's API key. Atomic INCR with TTL gives exact enforcement. Because billing depends on it, fail-closed: if Redis is unavailable, reject requests rather than allowing free unlimited usage.

Strong answer: "Tier-aware token bucket at the API gateway, enforced through a centralized Redis instance with atomic INCR. Fail closed on Redis failure because billing depends on enforcement. Different bucket sizes per tier; enterprise limits configured per customer."

Scenario 02

Your public API is hit by a sudden 10x traffic spike. Most of it appears to be from real users (a viral moment). How do you handle it without locking everyone out?

Normal traffic: 1K req/sec. Current: 10K req/sec. Backend capacity: 5K req/sec sustained. Limits configured at the API gateway, currently rejecting most users with 429.

How to think about this

The reflexive response is to raise limits. That's the wrong move because it exceeds backend capacity; the system would crash entirely instead of degrading.

The right move is graduated response combined with capacity-aware throttling:

1. Cache aggressively. Most viral traffic hits the same few endpoints (the thing that went viral). A 5-minute cache for those endpoints can absorb most of the spike without backend involvement.

2. Differentiate authenticated from anonymous traffic. Raise limits for authenticated users (they're more likely to be real). Keep limits tight for anonymous traffic.

3. Degrade non-essential features. The expensive features (search, recommendations) get tighter limits or temporarily disabled. The core read paths stay available.

4. Add capacity if sustained. If the spike is more than a brief moment, scale up. Rate limiting is a tool to survive bursts, not to substitute for capacity.

Strong answer: "Cache the viral content to absorb most of the spike at the edge. Raise limits for authenticated users. Tighten limits on expensive endpoints. Add capacity if sustained. Rate limiting is shaping the spike, not replacing capacity."

Scenario 03

A team proposes "exact" sliding-window rate limiting for every API endpoint. Should they?

The proposal: store every request's timestamp in Redis. On each new request, count timestamps within the rolling window. If count exceeds limit, reject. Estimated traffic: 50K requests per second per server, 20 servers.

How to think about this

Probably not. The proposal is technically correct (exact sliding window) but operationally expensive at this scale.

The math: 50K req/s × 20 servers = 1M req/s. Each request adds a Redis ZADD (timestamp into a sorted set) and a ZRANGEBYSCORE (count within window). At 1M req/s, that's 2M Redis operations per second per limit being enforced. Even with Redis clustering, this is a non-trivial operational burden, and the cost grows with each limit.

Better approaches in priority order:

1. Token bucket. One INCR + EXPIRE per request. 1M ops/sec total, well within Redis capacity. Slight inaccuracy at boundaries is acceptable for most use cases.

2. Hybrid sliding window. Cloudflare's approach: weighted average of two adjacent fixed windows. Approximates sliding window with two counters instead of N timestamps. Much cheaper, almost as accurate.

3. Reserve exact sliding window for cases that actually need it. Billing endpoints, hard quotas, security-critical paths. Don't use it as the default for every API.

Strong answer: "Exact sliding window is expensive at this scale; we'd reserve it for billing-critical endpoints. Token bucket as the default; hybrid sliding window where we need better accuracy without the per-request timestamp cost. Match the algorithm's cost to the requirement, not to the desire for theoretical correctness everywhere."

11Rate Limiting FAQ

What HTTP status code should I return?

HTTP 429 Too Many Requests. This is the standard. Include a Retry-After header with either seconds-until-retry or an HTTP date. Optionally include a structured response body explaining which limit was hit (per-user vs per-IP vs per-tier) so client developers can debug. Don't use 503, 403, or 400 for rate limit responses; these have other meanings and confuse clients.

Should rate limits apply to authenticated and unauthenticated users the same?

No. Anonymous users get strict per-IP limits because that's the only signal you have. Authenticated users get per-user limits, usually higher. The two layers compose: an anonymous user is bounded by per-IP; an authenticated user is bounded by per-user (which is usually more permissive). This is also why mixing them up causes the shared-NAT problem from Failure 04.

How do I rate limit by API endpoint cost?

Weighted token bucket. Cheap endpoints consume one token; expensive endpoints consume many (proportional to their cost). The user's bucket holds a budget that drains based on what they actually do, not just request count. GitHub's API famously does this: a single request might consume one or many points depending on what it accesses.

What's the right limit value?

Derive it from system capacity, not from convention. If your backend handles 10K writes per second and you expect 1000 concurrent users, each user can write at 10 per second sustained. Set the limit there. Round numbers (100 per minute) are easier to communicate but should be backed by capacity math, not pulled from thin air. Revisit limits when you scale capacity or when traffic patterns shift.

Do I need rate limiting at all on internal APIs?

Yes, but for different reasons. External rate limits prevent abuse and enforce business rules. Internal rate limits prevent one service from accidentally taking down another (a deploy bug, an infinite retry loop). Internal limits are usually more permissive and configured by capacity rather than tier. Service-mesh-level limits (Envoy, Linkerd) handle this with minimal application code.

How do I rate limit websockets and long connections?

Differently. Per-connection limits (messages per second on a single websocket) instead of per-request limits. Connection-establishment limits (no more than N new connections per minute per user). Often combined with per-user-across-all-connections limits to prevent one user from holding many connections to bypass per-connection limits. The load balancing deep-dive covers connection-level concerns more broadly.

What about rate limiting for AI/LLM API calls?

Increasingly important. LLM calls cost real money per token; uncontrolled usage is a direct cost runaway. Rate limit by token count, not just call count (a 100K-token call is much more expensive than a 100-token call). Set tier-based daily and monthly limits. Reject requests that would exceed the budget rather than allowing them and hoping it works out. The dedicated AI infrastructure deep-dive covers this in more detail.

Should I build rate limiting myself or use a managed service?

Managed for the easy cases. Cloudflare, AWS WAF, and API gateway services include rate limiting. Use them for edge-level and per-IP/per-API-key enforcement. Build your own only when you need application-specific semantics: per-resource limits, per-feature limits, dynamic tier adjustment. The build-vs-buy line is roughly: gateway-level rate limiting is bought; business-rule rate limiting is built.

Continue

Search and Indexing →

The next concept on the recommended learning path. Inverted indexes, ranking, why search is its own database category, and the Elasticsearch-as-secondary-index pattern that shows up in most production architectures.