How do you enforce tenant‑level quotas without hot partitions?

Tenant level quotas ensure that one customer cannot consume all shared capacity in a many tenant platform. The tricky part is doing this across a fleet of stateless services without sending every request to the same counter that becomes a single hot key. In interviews and in production, the winning pattern is to combine smart partitioning with local enforcement so you avoid any one partition getting hammered while still keeping limits accurate.

Why It Matters

Quotas are guardrails for fairness, cost control, and abuse prevention. If you track usage with a single per tenant key in a database or cache, a popular tenant can overload one partition and throttle everyone behind it. That hurts latency, violates service level objectives, and inflates infrastructure spend. In a system design interview, showing that you can enforce quotas without hot partitions signals that you know data modeling, partitioning, and admission control at scale.

How It Works Step by Step

This section gives a practical blueprint that balances accuracy and scalability.

Define quota semantics: Decide what you limit. Examples include requests per second, writes per minute, concurrent jobs, or compute credits. Choose a time window model such as fixed window, sliding window, or token bucket. Token bucket is often best for smooth burst handling.
Create a time window key: For each tenant and for each window, derive a key like tenant id plus window id, where window id is the epoch second or minute. This keeps counters small and self expiring with a time to live.
Stripe the counter per tenant: Instead of one key per tenant per window, create S stripes for that tenant and window. Each request picks one stripe using a hash of request id, server id, or a random nonce. That spreads load across many partitions. A typical S is in the range of thirty two to one hundred twenty eight for busy tenants.
Give each stripe a local sub quota: Set sub quota as global quota divided by number of stripes plus a small slack. Each stripe admits requests if its local counter is under sub quota. This avoids cross stripe reads on the hot path. You slightly under admit in the worst case, which is a deliberate trade off to avoid hot keys.
Use conditional increment on write: For the chosen stripe, attempt an atomic increment with a guard condition count less than or equal to sub quota. If the condition fails, reject the request locally. No global read or write is required, so there is no single slammed key.
Add local token buckets at the edge: To cut write amplification, place a token bucket inside each gateway or service process. Refill it by leasing tokens from the control plane in batches. Each batch corresponds to a small slice of the tenant budget, for example one or two percent. Requests consume local tokens with zero network round trips. When tokens are low, the process fetches another lease using the same striped keys. The lease size adapts to observed traffic, which prevents any single stripe from getting hot.
Reconcile occasionally: A background task sums the stripes for analytics or alerting. This is off the critical path. It can sample a subset of stripes or read all stripes at a low frequency. Because counters are window scoped with time to live, data rolls off by itself.
Handle bursts with soft and hard guards: Soft guard is the local token bucket that allows small bursts as long as you have leased tokens. Hard guard is the stripe conditional increment that prevents the global budget from being exceeded across the fleet. Together they give fast path performance and strong control over the budget.
Make it regional and failure aware: Enforce quotas per region with the same design. Optionally apply a global super budget that is lower than the sum of regional budgets to absorb clock skew, retries, and failover traffic. On cache or database outage, fail safe for low risk tenants and fail closed for risky tenants, guided by a policy.
Tune with feedback loops: Watch per tenant admission rates, rejection codes, and tail latency. Increase stripe count for heavy tenants. Reduce lease size for spiky tenants. Cap per process outstanding leases to prevent token hoarding.

This blueprint keeps per request work local and cheap while avoiding any single hot partition. You get predictable latency with bounded global coordination.

Real World Example

Suppose you run an API that offers one thousand requests per minute per tenant across two hundred stateless gateways. A naive single counter per tenant per minute inside a key value store will all land on one partition and collapse under peak load. With the striped approach, each request lands on one of sixty four stripes inside that minute. Gateways also run local token buckets with small leases of twenty tokens each. Under traffic, gateways admit locally and periodically top up, while stripe counters ensure the total budget is never exceeded. Even during a sudden spike, load spreads across many keys and many partitions, and no single counter becomes a choke point.

Common Pitfalls or Trade offs

Under admitting due to fixed sub quota. If traffic is skewed to a few stripes you may reject a bit early. Mitigate by using power of two choices where each request hashes to two stripes and you pick the less loaded one.
Global strictness vs latency. Perfect accuracy would require reading all stripes on every request, which defeats the purpose. The lease plus stripe guard is a good balance that is accurate enough for most quotas.
Clock skew and window edges. Requests near window boundaries can double count or get rejected due to drift. Use monotonic server time, short grace overlap between windows, and a small per window carryover bucket.
Token leakage during failures. If a process crashes with unused leased tokens, you can double spend. Add lease expiry and idempotent reconciliation so unused tokens return to the pool.
Uneven tenant shapes. Some tenants are quiet most of the day and burst at the top of the hour. Track tenant traffic shape and adapt stripe count and lease size per tenant rather than using one size for all.

Interview Tip

Explain the naive design first single counter per tenant, then say why it creates a hot partition. Propose stripes per tenant per window, local token buckets with leases, and periodic reconciliation. Call out the trade off between strict accuracy and low latency, and show how you tune stripe count and lease size. Close with one sentence on failure policy and regional budgets.

Key Takeaways

Avoid a single counter per tenant per window. Stripe it.
Enforce locally with token buckets and lease small batches from a control plane.
Use conditional increments per stripe to keep the global budget under control.
Reconcile out of band and tune stripe count and lease size per tenant.
Design for regional isolation, small grace windows, and clear failure policy.

Table of Comparison

Approach	How it works	Pros	Cons	Best fit
Single counter per tenant	One key holds the window count	Simple to reason about	Hot partition, high latency spikes, write contention	Tiny traffic or internal tools
Striped counters	Many keys per tenant per window with local sub-quotas	Spreads load, removes hot keys	Slight under-admission in worst case, background sum needed	Most production quota systems
Edge token bucket with leases	Gateways admit from local tokens and fetch small batches	Zero hot-path reads, steady latency	Requires lease expiry and reconciliation	High-throughput APIs
Centralized limiter service	All requests call a shared limiter microservice	Strong consistency, easier auditing	Service can become a bottleneck, extra network hop	Strict compliance or low-volume systems
Sketch-based approximation	Uses count-min sketch to detect heavy tenants and escalate	Very memory-efficient for many tenants	Approximate, requires a second stage for strict limits	Platforms with large tenant counts

FAQs

Q1. What is a hot partition?

A hot partition is a shard in a database or cache that receives a disproportionate share of traffic. For quotas this happens when all requests for a tenant update the same key.

Q2. How does token bucket differ from sliding window for quotas?

Token bucket allows controlled bursts by letting tokens accumulate over time, while sliding window enforces a strict rate over the last N seconds. For user facing APIs that need smooth behavior, token bucket at the edge plus striped counters for budget control is a strong combination.

Q3. How many stripes should I use per tenant?

Start with thirty two for moderate tenants and up to one hundred twenty eight for very large tenants. Use telemetry to increase or decrease per tenant. More stripes reduce contention but increase background reconciliation cost.

Q4. How do I enforce a global limit across many regions?

Give each region its own budget and keep a global super budget below the total of all regions. During failover increase the affected region budget temporarily. Use small leases and quick expiry to prevent double spending.

Q5. How do I handle sudden bursts without false rejects?

Use a local token bucket with adaptive lease size. During a burst, the process burns local tokens and fetches another small lease. The stripe sub quota guard prevents the global budget from being exceeded even when many processes burst at once.

Q6. Can I change tenant quotas dynamically?

Yes. Store current quota in the control plane and push a new value with a version. New leases use the new quota. Stripes for the next window use updated sub quotas automatically.

Further Learning

Level up your quota and admission control designs with the fundamentals of API rate limiting, partitioning, and caches in the course Grokking System Design Fundamentals. For a deeper blueprint driven approach to building scalable architecture, enroll in Grokking Scalable Systems for Interviews. If you want an interview ready path that ties these techniques into full system design answers, explore Grokking the System Design Interview.