How to build HyperLogLog‑based unique counts at scale?
HyperLogLog is a probabilistic data structure that gives very accurate estimates of the number of unique items in a stream while using tiny memory. Think of it as a smart sketch that compresses the essence of a huge set into a few kilobytes yet still answers how many distinct users or keys you have. In large services the exact approach with hash sets is often too slow and too memory heavy. HyperLogLog keeps your counts fast and cheap with a small error that you can control.
Why It Matters
Unique counting problems show up everywhere in scalable architecture and distributed systems. You measure daily active users, unique viewers of a show, distinct search queries, unique ad impressions, or the size of a join key before you build an index. In a system design interview, you will be asked to estimate resources, reason about accuracy, and propose merge friendly data structures. HyperLogLog shines because it supports constant time updates, near free union across shards, and predictable memory. It becomes the backbone for rollups in stream processors, feature analytics, and realtime dashboards where exact counts are overkill but consistency and repeatability still matter.
How It Works Step by Step
-
Pick a strong 64 bit hash for the element id. Hashes should be uniform. Common choices are Murmur or xxHash.
-
Choose p leading bits of the hash as the register index. That gives m equals two power p registers. Typical p values are from 12 to 16 which gives from 4096 to 65536 registers.
-
For the remaining bits compute rho which is one plus the count of leading zeros. This gives a geometric signal of how rare the hash was.
-
Update the register with index i to max of its current value and rho. Each register holds a small integer, often six bits in HLL plus plus.
-
To estimate cardinality, compute a harmonic like mean over registers with a bias corrected constant alpha. Core formula is proportional to m squared divided by the sum of two raised to negative register values.
-
Use small range correction when the estimate is low compared to m. At low counts linear counting over the set of zero registers is more accurate.
-
Use large range correction when the estimate approaches the hash space limit. This guards against overestimation near saturation.
-
Merge is trivial. For each register take the maximum across sketches. Because update and merge commute you can aggregate in any order across shards or time buckets.
-
Memory is fixed by m and bits per register. With m equals 16384 and six bit registers you use roughly 12 kilobytes and typical relative error around zero point eight percent since error is about one point zero four divided by square root of m.
-
HLL plus plus adds a sparse mode that stores hashed items directly when the set is tiny, then switches to dense registers once a threshold is crossed. You get the best of both worlds for small and large counts.
Real World Example
Take Instagram style story analytics. The product team wants unique viewers per creator per day with filters by region and device. Events flow through a message bus. A stream job reads view events, deduplicates on a composite key of viewer id plus story id plus day, and updates a HyperLogLog per dimension tuple. During the day the job keeps all sketches in an in memory store like Redis or a state store like RocksDB. Every five minutes the job writes register blobs to a column store or object storage partitioned by day. Dashboards query a service that merges the needed sketches on the fly, for example to combine all story ids for a creator or union all regions for a global view. Backfills for missing hours are done by replaying logs and re merging. Because merge is register wise max you never lose correctness by reprocessing.
Common Pitfalls or Trade offs
Bad hashing destroys accuracy If the hash is not uniform, leading zeros are biased and the estimate drifts. Use a strong 64 bit hash and salt it if keys have structure.
Wrong p value choice Too small m gives high error. Too large m makes each counter heavy and can blow up memory across high cardinality dimensions. Start with p equals 14 as a balanced default and make it configurable per metric.
Using HLL for intersections HyperLogLog gives easy union. Intersections and set differences are not directly supported. If you need overlap rates use Theta or KMV sketches or MinHash for Jaccard estimation.
Tiny counts without sparse mode Plain dense HLL has bias for very small sets. Use HLL plus plus sparse representation or switch to exact sets or bitmaps until a threshold.
Deletion and sliding windows HLL is not deletable. For sliding windows keep time partitioned sketches like one per minute then union the last N buckets. For heavy churn consider exponential decay windows with weighted unions.
Key explosion Attaching too many dimensions produces a sketch per key and memory climbs fast. Use dimension caps, bucketing, or pre approved tag sets, and monitor top N key growth.
Opaque storage Storing HLL state in a custom binary without versioning locks you in. Keep a versioned register format and a metadata schema so you can upgrade p or the algorithm later.
Interview Tip
A common prompt is to design a service that reports unique daily active users across regions in near real time. A strong discussion mentions p selection and error target, sparse to dense transition, per bucket storage for sliding windows, and how union across shards is register wise max. Bonus points if you explain why intersections need a different sketch and how you would validate error by holdout exact counts on a sample.
Key Takeaways
-
HyperLogLog compresses huge sets into small fixed memory with a predictable relative error that is roughly one point zero four divided by square root of m.
-
Merge is register wise max which makes union across shards and time trivial.
-
Use HLL plus plus to get sparse storage at small sizes and bias correction.
-
Choose p based on target error and cost per counter and keep it configurable.
-
Do not use HLL for intersections or deletions. Use time partitioning for sliding windows and other sketches for overlaps.
Table of Comparison
| Method | Memory per Counter | Relative Error | Union Support | Intersection Support | Good For | Not Good For |
|---|---|---|---|---|---|---|
| HyperLogLog | KB-scale fixed | ≈1% with m ≈16k | Yes (register max) | No direct | Unique users, events, or keys at scale | Exact counts, deletes, small sets without sparse mode |
| HyperLogLog++ | Bytes to KB (sparse) | Same or better with bias fixes | Yes | No direct | Mixed small and large cardinalities | Intersections, deletions |
| Exact Hash Set | Grows with data size | 0% | Costly (set union) | Yes | Low-volume, accuracy-critical analytics | High-volume, real-time analytics |
| Bitmap | Max ID ÷ 8 bytes | 0% | Yes (bitwise OR) | Yes (bitwise AND) | Dense small ID spaces (bounded ranges) | Sparse or unbounded ID domains |
| KMV / Theta Sketch | Linear in k | ≈1 / √k | Yes | Yes (approximate) | Intersections, unions, overlap estimations | Very small memory per key when only union is needed |
| Linear Counting | Bits = universe size | Good for small cardinalities | Possible | Possible | Small domains (short URL codes, small sets) | Large hash spaces |
FAQs
Q1. What error can I expect from HyperLogLog?
With m registers the typical relative error is about one point zero four divided by square root of m. For m equals 16384 the error is near zero point eight percent.
Q2. How do I choose p and m for my counters?
Pick p so that m equals two power p fits your memory budget and error target. Many teams start with p equals 14 which balances memory near 12 kilobytes and sub one percent error.
Q3. Can I get exact counts later from HyperLogLog?
No. The sketch is lossy. If you must audit exact values keep a sampled exact counter or raw logs for a small slice to validate estimates.
Q4. How do I handle sliding windows with HyperLogLog?
Store one sketch per small time bucket and union the buckets that fall in the window. For example keep one per minute and union the last sixty for an hour. Drop old buckets to expire data.
Q5. Can I query intersections with HyperLogLog?
Not directly. Use Theta or KMV sketches or MinHash to estimate overlaps. If you only need union stick with HyperLogLog.
Q6. What is the difference between HyperLogLog and HyperLogLog plus plus?
HLL plus plus adds a sparse mode and improved bias correction. It uses very little memory when the set is small and switches to dense registers as it grows.
Further Learning
If you want a solid foundation in cardinality sketches and when to use them, explore the primer in Grokking System Design Fundamentals.
For deeper treatment of streaming analytics, partitioning strategies, and accuracy trade offs in large services, enroll in Grokking Scalable Systems for Interviews.
If you are preparing end to end for a system design interview and want guided practice with sketch design and data pipeline questions, check Grokking the System Design Interview.
GET YOUR FREE
Coding Questions Catalog
$197

$78
$78