0% completed

A Globally Unique Identifier (GUID) Generation System is a service that produces unique identifiers at scale for use across distributed systems. Its purpose is to generate IDs that never collide (no duplicates) even when issued from multiple data centers or services concurrently. These IDs can serve as database primary keys, event/message IDs, object handles, or any context where a distinct identifier is needed for each entity. The system supports two ID formats: opaque IDs and ordered IDs. Opaque IDs are essentially random or pseudo-random identifiers (like UUIDs) with no intrinsic ordering or meaning, useful for security or purely unique labeling. Ordered IDs are time-based or monotonic IDs that encode time or sequence information so that newer IDs are numerically or lexicographically larger than older ones, which is useful for sorting and roughly reflects creation order. “Global uniqueness” means any ID generated on any node or at any time will be different from all others, and web-scale implies the system can handle very high request rates (hundreds of thousands per second) and operate across many regions and services. Key terms include:
- UUIDs (Universally Unique Identifiers): A UUID (or GUID in Microsoft terminology) is a 128-bit value used to uniquely label objects in computer systems. Common versions (like UUIDv4) are generated via random or pseudo-random numbers. They are opaque identifiers – by design they carry no inherent timing or ordering information, just a statistically unique random value. With 2^128 possibilities, collisions are practically negligible (e.g. even generating 1 billion UUIDs per second for 100 years yields only ~50% chance of one collision). UUIDs require no coordination between generators, but are not naturally sorted and are relatively long.
- Snowflake ID: A 64-bit ordered ID format introduced by Twitter, composed of time, machine, and sequence bits. Many distributed ID schemes (e.g. Instagram’s ID, Discord’s IDs) are variants of this pattern.
- ULID/KSUID: Newer “lexicographically sortable” GUID formats that combine a timestamp with random bits. These produce ordered 128-bit IDs without coordination.
- Monotonic: In this context, an ID sequence that, within a given scope, only increases (e.g. IDs generated later have higher value than earlier ones). Fully global monotonic ordering is hard to guarantee without a single source of time, but each generator or each timeline will produce monotonic sequences.
- Database Auto-Increment Keys: Many databases can auto-increment a numeric primary key. This yields unique, sequential IDs (easy to sort by creation order within that table). However, in a distributed system with multiple databases or shards, coordinating a single sequence becomes a challenge. Traditional auto-increment is essentially a single-node solution – using it globally would create a single point of failure and a bottleneck for high throughput, or else separate sequences per shard would produce duplicates across shards (unless partitioned ranges are pre-assigned to each shard).
Use Cases: This system would be used whenever multiple distributed components need to create unique references without a central bottleneck. For example, web servers generating database keys for new records, microservices tagging events with IDs for tracing, or IoT devices labeling sensor readings – all without risk of collision. By providing both opaque IDs (for unpredictable, secure identifiers) and time-ordered IDs (for sorted, timeline-friendly keys), the system covers a wide range of needs.
Functional Requirements:
- Generate Unique IDs on Demand: The system must provide an interface (e.g. a service API) that clients can call to obtain a new unique identifier at any time. Every ID returned must be globally unique – no two calls should ever produce the same ID, even if they come from different services or geographic regions.
- Support Time-Sortable and Opaque IDs: Clients should be able to request two formats of IDs: (1) a time-sortable ID (an ID which, when compared to others, reflects creation time order), and (2) an opaque ID (an ID which carries no obvious relation to creation time or sequence). This could be exposed via two API endpoints or a parameter (for example,
GET /generate?type=time
vstype=opaque
). Both types must still be globally unique across the entire system. - Simple Client API: Provide easy-to-use APIs for services to request IDs. Likely a REST or gRPC endpoint that returns a new ID (or even a batch of IDs) with low overhead. For instance,
POST /ids
could return a JSON containing a new ID. Supporting batch requests (e.g. ask for 100 IDs in one call) could be useful to reduce call overhead for clients that need many IDs. - No Offline Client ID Generation: All ID generation will occur in the centralized service (clients won’t generate their own IDs offline). This simplifies the design – we don’t need to accommodate merging offline-generated IDs or resolving conflicts from offline nodes. Clients must be online and able to reach the service to get an ID.
Non-Functional Requirements:
- High Throughput: The system should handle hundreds of thousands of ID requests per second. We target at least 100,000 IDs/sec sustained, with ability to scale beyond (e.g. peak bursts of 200k–500k/sec) by horizontal scaling. Each individual generator node should handle a large share of this (tens of thousands per second each), and we’ll run enough nodes to meet demand.
- Low Latency: ID generation should be very fast – each request ideally completes in a few milliseconds. We have an SLA of < 200 ms per request, meaning even at peak load the 99th percentile latency should stay under 200ms. Typical responses will likely be much quicker (a handful of ms) on average, since generation is computationally inexpensive. Keeping latency low ensures the ID service isn’t a bottleneck for client operations.
- High Availability & Fault Tolerance: The service must be reliable and highly available – target 99.99% uptime or better. It should have no single point of failure: if a server or even an entire datacenter goes down, ID generation should continue unaffected (perhaps with clients failing over to another region’s service). Redundancy and failover mechanisms are essential.
- Global Consistency: Ensure global uniqueness of IDs across all datacenters/regions. The design must coordinate or partition ID space such that two different regions never produce colliding IDs. We also want consistency in format – e.g. an ID generated in Europe should have the same structure and length as one from the US region.
- Scalability: The system should scale horizontally. We should be able to add more ID generator instances (in existing or new regions) to increase capacity without major reconfiguration. If traffic doubles, we can deploy more nodes or even introduce new bit partitions (if needed) to handle more IDs. The architecture should accommodate growth in QPS and number of client services smoothly.
- Precision and Ordering Guarantees: The time-sortable IDs should preserve insertion order at least within reasonable bounds (i.e. if one ID is created after another, it should usually have a higher sortable value). Small anomalies due to distributed clocks are acceptable, but generally the ordering should hold. The system’s design (especially for time-sortable IDs) should minimize clock skew issues and handle them gracefully (e.g. if clocks drift or jump, avoid generating duplicates or significant disorder).
- Maintain ID Size Constraints: Many use-cases prefer IDs that are not too large. The time-sortable IDs will be numeric (we aim for 64-bit, which is efficient for databases). Opaque IDs might be larger (e.g. 128-bit UUID format), but we should avoid excessively large identifiers to reduce storage and indexing overhead.
Before diving into design, let’s estimate the load and storage implications:
- Request Volume: Assume ~100k ID generation requests per second as a baseline, which is 8.6×10^9 IDs per day (100,000 * 86,400). If we plan for peaks of 200–300k/sec, that’s on the order of 1.7×10^10 to 2.6×10^10 IDs/day. Over a year, this could be trillions of IDs. The system must handle this volume continuously. Each ID itself is small, but the sheer count is large.
- ID Size and Format: A time-sortable ID will be a 64-bit number (8 bytes). In decimal string form it might be up to 19–20 digits. An opaque ID we might implement as a 128-bit UUID (16 bytes), typically represented as a 36-character string (including hyphens). These sizes are modest per ID. For example, 64 bits can uniquely represent up to 9.22×10^18 values, which is plenty for the foreseeable number of IDs (even 10^10 per day for 10 years is 3.65×10^12 total). A 128-bit space (~3.4×10^38 possibilities) is astronomically large, effectively ensuring uniqueness.
- Data Storage: The ID generation service does not need to permanently store each ID it generates – it only computes and returns it. We avoid any requirement to log every ID (which would be massive data). The only persistent data might be small metadata: for example, if using a database in segment mode (discussed later), we’d store the current max ID for each segment. Or if using a coordinator for machine IDs, it might store assigned IDs. These are tiny (bytes or a few integers). Thus, storage costs are minimal. We trade off storage for compute; it’s acceptable since generating an ID is cheap.
- Memory and State: Each generator instance might hold some in-memory state, like the last timestamp and sequence number it used, and its assigned machine/region ID. This is on the order of a few bytes per instance. Even with 1000 instances, that’s trivial memory. No large in-memory data structures are required, since we’re not caching large datasets – just computing on the fly.
- Network Throughput: At 100k IDs/sec, if each response is, say, ~50 bytes (ID plus protocol overhead), that’s ~5 MB/sec outbound from the service cluster – easily handled by modern network interfaces. Even 500k/sec would be ~25 MB/sec, which is high but can be distributed across nodes and regions. Internally, a few coordination messages (heartbeats to a coordinator or DB queries) will be much lower volume than the ID traffic.
- Read/Write Ratio: The workload is overwhelmingly write-like (generating new IDs). There is almost no concept of “reading” existing IDs from storage – an ID is generated and delivered, not stored for lookup. We might occasionally validate an ID format (e.g., if a client asks “is this a valid ID string?”) but that’s just computation, not reading from a database. If using a DB for segments, those involve writes (incrementing counters). Overall, it’s a write-heavy, insert-only scenario. This simplifies consistency concerns: we don’t have to manage complex read queries or multi-row transactions, just ensure each generation action is unique and fast.
Architecture Overview: We will build a dedicated ID Generation Service that runs in multiple regions and multiple instances for scalability. Here are the main components and their interactions in our design:
- Client Services: These are the various microservices or applications that need unique IDs (for new database records, objects, etc.). Instead of generating IDs themselves, they will call our ID generator service whenever they require a new ID.
- API Gateway / Load Balancer: Clients will reach out to a well-known endpoint for the ID service. A global load balancer or API gateway will route requests to an ID generator instance. This could be done via DNS (directing to a regional endpoint) or via an anycast IP or cloud load balancer that finds a nearby healthy instance. The gateway ensures even load distribution and directs traffic to the closest datacenter for low latency.
- ID Generation Service Nodes: These are the worker nodes (servers or containers) that actually create IDs. They run our ID generation algorithm (which we’ll detail in the next section). We will have multiple instances per region, each aware of a unique node ID (so they don’t produce colliding IDs). They are essentially stateless in that any node can handle any request – no persistent per-request data – but each has some configuration (its node identifier and algorithm state). The service is likely implemented as a lightweight service in a high-performance language (Java, Go, C++, etc.), optimized for fast atomic operations and time retrieval.
- Coordinator for Node IDs: To guarantee uniqueness across datacenters and nodes, we’ll use a small coordination service. A coordination component (such as Apache ZooKeeper or etcd or a lightweight consensus service) will assign each generator node a unique identifier (consisting of region ID and machine ID bits). For example, when a generator node starts up, it contacts the coordinator to obtain a free “worker ID”. The coordinator keeps track of which IDs are in use to avoid duplication. This ensures no two live nodes ever use the same ID space. (If the coordinator goes down temporarily, existing nodes continue using their IDs, but no new nodes can join – we’ll run the coordinator as a robust cluster to minimize downtime.)
- (Optional) Database for Segment Allocation: Note: In our final design we favor the Snowflake approach without a central DB. But if we were to support an alternative mode (segment-based ID blocks), a relational database or key-value store might be used to store and increment counters. In that mode, each request to the DB yields a range of IDs which the service can then hand out. This DB would be a critical component (and potential bottleneck), so in our chosen approach we avoid it except as a backup or hybrid solution.
Request Flow:
-
Client Request: A client needing a new ID makes a request to the ID generation service’s API (for example, an HTTP
GET /generate_id?type=time
ortype=opaque
). This request goes to the load balancer or API gateway. -
Routing: The load balancer forwards the request to one of the available ID generator nodes in the nearest region (using health checks and possibly considering client location). For global clients, DNS might resolve to a local region’s endpoint. For a truly global service, we might also allow cross-region calls if one region is overloaded (though normally each region can handle its local traffic).
-
ID Generation: On the chosen generator node, the service code checks the request type:
- If time-sortable ID is requested, the node will use the Snowflake-like algorithm: it reads the current timestamp in milliseconds, and combines it with its configured region/machine ID and an internal sequence counter to generate the next ID. (Detailed algorithm in next section.)
- If opaque ID is requested, the node will use a random ID generator: e.g. call a secure random number generator to produce 128 bits, or use a library call to generate a UUIDv4. This does not require any shared state (each call is independent). In either case, the computation is local and very fast – just a couple of arithmetic and bit operations or random bytes generation. There is no additional network call on this critical path (the service node doesn’t need to ask any other node to get an ID in our design).
-
Response: The service node returns the generated ID to the client, typically as a numeric string (for time-sortable) or a UUID-formatted string for opaque. For example, it might return
{"id": 1541757210912000000}
or{"id": "c2f3a8d0-7e5b-4efb-b1a2-5d91c1e315b8"}
depending on type. The client receives this response, and then uses the ID (e.g. as a key in their database insert). The whole round-trip is quick; even with network overhead, it should be well under the 200 ms SLA – often a few ms within the same region. -
Failure Handling: If the request fails (e.g. the chosen node was unresponsive), the client or gateway can retry, potentially hitting a different node. The ID service is idempotent in the sense that a retried request will just produce a new ID (there’s no harm in unused IDs – an ID that was generated but not used is just lost, with no conflict). Therefore, retries are safe. Clients could even preemptively request a batch of IDs to have spares on hand in case of transient issues (this is an optional client-side optimization).
Multi-Region Uniqueness: We ensure uniqueness across regions by assigning a distinct region identifier to each region’s cluster of ID nodes. This region ID forms part of the generated ID (for time-sortable IDs) or influences the ID (for opaque, we might not include it explicitly, but we can simply rely on randomness being globally unique). For the Snowflake IDs, for example, we use some of the machine-id bits to encode the region. For instance, a 10-bit machine ID could be split into a 5-bit datacenter ID and 5-bit host ID. Thus, Region A might be datacenter 5 and Region B datacenter 6, etc. This means even if two machines in different regions generate an ID at the exact same millisecond with the same sequence number, the datacenter portion will differ, making the 64-bit IDs distinct. The coordinator service can manage this by pre-allocating a range of machine IDs to each region (e.g. region 1 gets IDs 0–31 for its nodes, region 2 gets 32–63, etc., if using 5-bit region). Alternatively, the region ID can be explicitly one field in the ID. In any case, no coordination between regions is needed per request – the design inherently prevents overlap.
Example Flow: Suppose Service X in the US-East region needs a new ID for a user profile creation. It calls the ID API, the US-East load balancer routes to an ID node (with, say, machine ID = region 1, node 3). The node’s Snowflake algorithm takes the current timestamp (e.g. 2025-05-21 18:30:00.123 UTC
-> a 41-bit value relative to epoch), combines it with its datacenter=1 and worker=3 bits, and sequence (say 5 if it already made 5 IDs this millisecond), producing a 64-bit ID like 0x1E2C...
(some big integer). This is returned in, say, decimal form. Meanwhile, a client in EU region might be calling nearly simultaneously; their ID node has datacenter=2, maybe sequence 1, etc., yielding a different 64-bit number. Even if the timestamp part was identical, the datacenter bits differ so the IDs are unique. Later, when these IDs are stored in a database, sorting them will put the US-East one before or after the EU one depending on timestamp (if timestamps were truly equal to the ms, their relative order in numeric comparison will be determined by datacenter bits – so not strictly by true time, but such ties are extremely rare and only within the same millisecond window).
In summary, the high-level design uses distributed ID generator nodes that operate mostly independently, coordinated only by a lightweight service to assign unique node IDs. Clients interact with it through a simple network API. This design is stateless in the request/response path, enabling easy load balancing and horizontal scaling.
Let’s dive deeper into how the ID generation works, and the rationale behind chosen algorithms. We will compare a few strategies and then describe our chosen solution in detail.
5.1 Comparing ID Generation Approaches
-
UUID (Random 128-bit IDs): Generating a UUID (v4) is straightforward – each ID is 16 random bytes. The advantage is no coordination or global state needed: any node can generate IDs independently and the probability of collision is astronomically low. UUIDs are widely used in distributed systems for this reason. They are also opaque (no sequence or time info can be inferred). However, they are 128 bits long, which is overkill for many uses and can impact storage or indexing (twice the size of a 64-bit number). They also don’t provide ordering: one cannot sort by UUID and get chronological order (unless using newer versions like UUIDv7 which include time, but v4 is fully random). In contexts where order or smaller size matters, UUID v4 is less ideal.
-
Database Auto-Increment / Central Counter: One simple approach is to have a single, centralized counter (for example, a row in a SQL database that we
UPDATE
to increment and return the new value for each ID). This guarantees uniqueness and natural ordering. Some systems have used this (e.g. using a dedicated “ID server” or a DB table). The huge downside is scalability: every ID request hits the same database or primary node, creating a bottleneck. At 100k IDs/sec, a single DB sequence likely cannot keep up (due to transaction log and lock contention). Even if it could, it becomes a single point of failure – if that DB is down, no IDs can be generated. This approach also incurs high latency as each request is a DB write. It fails the high-throughput and HA requirements unless heavily optimized (caching sequences, etc., essentially evolving into the segment approach below). -
Segment (Batch) Allocation via Database: This is a compromise approach: use a database or persistent store to allocate chunks or segments of IDs to each generator node. For example, a node might ask the DB: “give me the next block of 10,000 IDs”. The DB atomically increases its counter by 10,000 and returns a range (say 1,000,000–1,009,999). The node can then serve IDs 1,000,000 upward from memory without further DB calls until it exhausts that block, at which point it requests a new block. This drastically reduces the DB contention (only one query per 10k IDs instead of per ID) and thus can scale much better – effectively the throughput is limited by how fast nodes can consume blocks and occasionally hit the DB. It also improves latency for most requests (since serving from memory is fast). However, the system still relies on the availability of that central DB for replenishing ID ranges. If the DB goes down or the network partition occurs when blocks need renewal, the service will eventually stall once all nodes use up their current ranges. Also, assigning fixed segments per node introduces waste: if a node crashes with half its segment unused, those IDs might never be issued (to avoid duplicates, we wouldn’t reassign that half segment). This waste is usually acceptable, though, in exchange for simplicity and uniqueness. The segment approach offers ordered IDs (if a single global counter is used, all IDs are increasing globally), but if multiple nodes in different regions get segments, those segments could be interleaved in time (e.g., node A’s segment covers IDs 1–10000 and node B has 10001–20000, but node B might generate some of its later IDs before node A finishes its range, slightly mixing the order of actual creation). Without careful orchestration, time ordering isn’t strict globally, though each node generates in order within its segment.
-
Distributed GUID Services (Snowflake and its variants): Twitter’s Snowflake algorithm is a well-known solution designed for high scale. It avoids any per-ID central bottleneck by embedding uniqueness factors into the ID itself. In Snowflake, a 64-bit ID is composed of: a timestamp, a machine identifier, and a sequence number. Each generator node can independently create IDs as long as it has a unique machine ID and the clocks are in sync. Twitter’s original implementation used 41 bits for time, 10 bits for machine (often split into 5-bit datacenter + 5-bit node), and 12 bits for sequence. This yields:
- 2^41 ≈ 2.2 trillion timestamp values (enough for 69 years at millisecond precision).
- 2^10 = 1024 unique machine IDs (up to 1024 nodes generating in parallel).
- 2^12 = 4096 IDs per node per millisecond.
This design can generate IDs extremely fast. One Snowflake node can output 4096 IDs in a single millisecond before it has to wait for the next millisecond. That’s up to ~4 million IDs/second on one machine theoretically, far above our requirements. In practice, Twitter and others rarely hit that limit, but it gives headroom. Snowflake IDs are time-sortable (mostly – if two different nodes generate IDs, a slightly slower clock could cause some slight disorder globally, but within one node it’s strictly ordered by time and sequence). The downside is that Snowflake requires coordination to assign machine IDs uniquely and needs all nodes to have reasonably synchronized clocks. There’s also a predictability issue: since the ID increases with time and sequence, someone observing IDs might infer approximate timestamps or system load (this is a security consideration; Snowflake IDs are not cryptographically random). Twitter’s implementation used ZooKeeper to manage machine ID assignment and to handle clock issues (e.g., if a clock went backward, Snowflake could halt ID generation until recovery to avoid duplicates). Despite complexity, Snowflake is highly scalable and has no single point of failure in the ID generation path – nodes can work independently. Many companies (Instagram, Discord, etc.) have adopted similar schemes, sometimes tweaking the bit allocations for their needs.
-
Other Approaches: There are other ID schemes like MongoDB’s ObjectID (a 96-bit ID with time, machine ID, and counter), or CUID/NanoID (client-generated, highly random IDs often as strings), and upcoming UUIDv7 (128-bit time-ordered UUID). MongoDB’s ObjectIDs are interesting – they include a 4-byte timestamp (to seconds precision), a 5-byte random host identifier, and 3-byte counter. They are larger than 64 bits and only roughly ordered (seconds, not milliseconds). CUID and NanoID are more for low-collision at modest scale (often used for front-end or offline generation), but they prioritize uniqueness and some randomness, potentially with bigger size strings. For our scale, these aren’t as commonly used in backend distributed systems as Snowflake or UUID due to either length or throughput considerations.
Why Snowflake + UUID (hybrid) for our design? Based on the above, we choose a hybrid approach to meet all requirements:
- For time-sortable IDs, we adopt the Snowflake-style strategy. It directly addresses the need for ordered, unique IDs at massive scale. With 64-bit IDs, storage and bandwidth overheads are low, and the IDs can be used as database keys efficiently. The generation is distributed and extremely fast, as needed for >100k IDs/sec. We will handle the coordination (machine ID assignment) and clock synchronization issues as manageable trade-offs for these benefits. The Snowflake approach meets the functional need for time-sortable identifiers and the non-functional needs for throughput and partition tolerance (no central bottleneck).
- For opaque IDs, we will use a secure random 128-bit ID (essentially a UUID v4). This satisfies the requirement that some IDs carry no timestamp or ordering info. By using a well-established UUID approach, we get extremely high uniqueness reliability without coordination. Each ID generator node can create opaque IDs independently, and the probability of collision between any two nodes’ outputs is so low it can be ignored in practice. This method also yields non-sequential, unguessable IDs suitable for public exposure (e.g. in URLs or API keys).
Importantly, both methods can be provided by the same service – for example, our ID generator node can implement two code paths: one for Snowflake IDs and one for UUIDs. This way, we maintain one system but support both formats. Many systems actually do use a combination depending on context (e.g., use time-based IDs internally but expose opaque IDs to external clients for security).
5.2 Snowflake ID Algorithm Design
For the time-sortable ID generation, we’ll follow the Snowflake format with some customization:
-
ID Bit Structure: We use a 64-bit unsigned integer for each ID. We will not use the most significant bit (bit 63) so that the ID fits in a signed 64-bit space if needed (Snowflake sets the top bit to 0). The remaining 63 bits are allocated as follows:
- Timestamp – 41 bits: This is the number of milliseconds since a custom epoch (a start time we define, e.g., Jan 1, 2025, or Twitter’s epoch of Nov 2010). 41 bits gives us ~2.04 trillion ms of range, which is about 69 years. For example, if we start at 2025, we’d be safe until ~2094. The timestamp ensures that as time increases, the ID’s highest bits increase.
- Region + Machine ID – 10 bits: We carve these 10 bits into a region identifier and machine identifier within the region. For instance, 5 bits for Region (up to 2^5 = 32 regions) and 5 bits for Machine (up to 32 nodes per region). Or 4 bits region (16 regions) and 6 bits machine (64 nodes each), depending on expected deployments. In our design, 32 regions is likely plenty (even very large systems might have on the order of dozens of datacenters). This field ensures different nodes produce distinct IDs even if their clocks and sequence overlap. The exact split can be adjusted; what’s important is the 10-bit combination is unique for each generator node. The coordinator service guarantees this uniqueness by assigning those IDs. (If we ever needed more than 1024 total nodes, we’d have to enlarge this field or introduce a second-tier coordination – but 1024 should suffice given the throughput each node can handle.)
- Sequence – 12 bits: This is a counter that each node uses to differentiate IDs generated within the same millisecond. It increments for each ID and resets to 0 when the millisecond timestamp changes. With 12 bits, a node can generate 2^12 = 4096 IDs in one millisecond before the counter would overflow. In the rare case that a node hits this limit (meaning it tried to generate >4096 IDs in one ms, which is >4 million IDs/sec rate on one machine), the algorithm will block until the next millisecond tick before continuing. This throttle ensures no duplicates – it will not reuse sequence numbers within the same timestamp. In practice, 4096 IDs/ms per node is an enormous rate, so hitting this will be infrequent (if it does become frequent, it’s a sign we should add more generator nodes to spread load, or split one node’s load into multiple processes with distinct machine IDs).
-
ID Format Example: Using this structure, an example 64-bit ID in binary might look like:
[timestamp: 41 bits][region+machine: 10 bits][seq: 12 bits]
For illustration, suppose the 41-bit timestamp (since epoch) is
101010...
(some value), region ID = 3 (00011
in 5 bits), machine ID = 5 (00101
in 5 bits), and sequence = 37 (0000 1001 01
in 12 bits). These bits concatenate to form the 64-bit number. In a real example, Snowflake ID 1922298559865028608 (which is a tweet ID) breaks down to timestamp, machine, sequence as documented. Because of this composition, one can extract the time and other info if they know the format, which is why we categorize it as non-opaque. But it serves the ordering and scaling purpose well. -
Machine ID Coordination: We will run a ZooKeeper (or similar) cluster that manages the 10-bit machine IDs. When an ID generator node starts, it will create an ephemeral node in ZooKeeper like
/id-generators/<region>/<machine_id>
. ZooKeeper can be configured to auto-assign an ID or the node can attempt to create nodes with incremental IDs until it finds a free one. Another approach is to have the node supply its region and ask ZooKeeper for a free machine slot in that region. For example, region 1 nodes can use IDs 0–31. ZooKeeper ensures two live nodes don’t get the same ID by keeping track of active ephemeral nodes. If a node dies, its ZK ephemeral node is removed, freeing up that ID for reuse. We must be careful with reuse: ideally, we don’t immediately reuse a machine ID of a node that just died until we’re sure its clock won’t cause issues. In practice, since the node is down, it won’t generate new IDs, so reuse is safe if the new node’s clock is >= the last timestamp of the old node. We might add a small buffer or record of last timestamp to avoid edge cases (more on clock issues below).- Failure scenario: If ZooKeeper itself fails, existing ID nodes can continue to generate IDs with their already-assigned IDs (no impact). We just wouldn’t be able to start new nodes or reassign IDs until ZK is back. To mitigate that, we run ZK as a 3-node ensemble across different servers (and possibly across regions) so that it’s highly available. The ID generator is not very sensitive to short ZK outages, as it rarely changes state after initial assignment.
-
Clock Synchronization and Handling Skew: All Snowflake-like systems depend on system clocks moving forward reasonably in sync. We will ensure that all generator nodes run NTP (Network Time Protocol) daemons to keep their clocks accurate to a few milliseconds. Minor clock differences mean one node’s IDs might appear slightly out-of-order compared to another’s in wall-clock time, but it won’t cause duplicates. The real risk is if a clock jumps backward on a node (e.g., an NTP correction or VM pause causes time to go back). Suppose node A’s clock goes 100ms backwards – suddenly its timestamp may duplicate a range it had already used. To prevent collisions:
- Each node’s algorithm will track the last timestamp used. If the current system time is less than the last timestamp, the generator will pause until the clock catches up beyond the last timestamp. For small skews, this is quick. For larger jumps, this essentially means the service on that node is stalled for that duration (worst case, an operator alert might be needed if the machine’s clock is really off). This pause prevents it from generating IDs that could duplicate previously generated ones.
- In practice, significant backward jumps are rare if NTP is configured with small adjustments. If a large jump does occur (say someone manually changed the clock), the safe approach might be to restart the service with a new machine ID (treat that node as a new instance) so it doesn’t overlap its past timeline.
- We could also record the last timestamp to stable storage on shutdown and refuse to start if time is behind that, but given we prefer stateless, a simpler runtime check is sufficient.
- Clock moving forward too fast (jumping ahead) is less dangerous for uniqueness (it’ll just produce very large timestamp values, which are still unique). But it could create a gap in the time ordering compared to other nodes. We don’t specifically prevent that, but NTP usually smooths adjustments to avoid big leaps. If a node’s clock jumped far into the future and generated IDs, then reverts, subsequent IDs from that node will have lower timestamps than it previously produced – which could break the sorting property slightly (those future IDs would appear as outliers). However, they’d still be unique. This situation is unlikely, but in extreme cases, manual intervention (like configuring NTP to step slowly or restarting the service after a big time correction) can manage it.
-
Concurrency and Throughput on a Node: The generation algorithm on a single node can be made very fast. It essentially does: read clock (time in ms), if same as last time -> increment sequence; if new time -> reset sequence=0. Then compose bits into ID. This can be done in a few CPU instructions. We will implement this carefully to handle concurrent calls on a multi-threaded server:
- Use a lock or atomic compare-and-swap on the timestamp and sequence values. Since this is very low-level, the overhead is minimal. Alternatively, some implementations use a single thread or coroutine to generate IDs sequentially to avoid locking and just feed an internal queue – but given the simplicity, even a locked section that does a couple of comparisons and increments can handle millions of ops/sec in modern CPUs. Our target ~100k/sec per node is easily handled with a lock (that’s only 100k lock operations per second, trivial for a CPU).
- If throughput per node needs to be higher, we could partition sequence ranges among threads (e.g., thread 1 uses sequence 0–2047 and thread 2 uses 2048–4095 within the same ms). But this adds complexity and likely isn’t needed unless a single machine must push the absolute limits. With multiple nodes available, we prefer to scale out rather than overly optimize one node’s multi-threading beyond necessity.
- The node will likely run as a stateless service (maybe an HTTP server or RPC server). Each incoming request triggers the ID generation routine. This routine should be extremely fast (microseconds), so the majority of request latency will be network overhead. We can thus handle many requests per second per thread. If needed, the service can pipeline or batch internally, but likely one ID per request is fine given the throughput.
-
Integration with the API: The generator node will format the 64-bit ID into the response. Often, systems return Snowflake IDs as decimal strings (Twitter, for example, exposes tweet IDs as large decimal numbers). We can do the same, or return as JSON number. Clients (especially in languages that can handle 64-bit ints) can treat it as an integer. Some databases might store it as a BIGINT. We must ensure that any client in JavaScript (which loses precision beyond 53-bit for numbers) gets it as a string to avoid precision issues. That’s more of an API detail, but worth noting for implementation.
5.3 Opaque ID (UUID) Generation Design
For the opaque IDs, our design is simpler:
- We will use a 128-bit UUID version 4 style generation. This essentially means: generate 16 random bytes using a CSPRNG (cryptographically secure pseudo-random number generator). Ensure the proper bits for UUID version and variant are set (UUID v4 requires certain bits to indicate it’s random). This yields an identifier such as
f47ac10b-58cc-4372-a567-0e02b2c3d479
(the canonical 8-4-4-4-12 hex digit format with hyphens). We can also choose to output it as a 32-character hex string without hyphens, depending on preference. The exact format doesn’t affect uniqueness. - Uniqueness and Collision: The chance of two 128-bit random values colliding is so incredibly low that we consider it impossible for practical purposes. As a cited estimate, generating 1 billion random UUIDs per second for 100 years gives a ~50% chance of one collision. Our system generating maybe 10^10 per day is nowhere near that scale. Thus, we don’t need any coordination or checking for duplicates. We trust the large address space. (We will use a good source of randomness; most languages provide secure RNG suitable for UUIDs, or even library functions to directly get a UUID.)
- Performance: Generating a random 128-bit number is very fast. The bottleneck might be converting it to string or formatting with hyphens, but that’s also trivial at our scale. Modern CPUs can generate millions of random numbers per second. Even with locking around a RNG, it’s not a problem to do 100k/sec. If needed, we can use thread-local RNG instances to avoid contention. The latency added per request is negligible (cryptographically secure RNG might be slightly slower than a normal RNG, but still microsecond-level per generation).
- Statelessness: Each call is independent. We don’t have to remember anything between calls. That means any node can handle an opaque-ID request at any time. We don’t even need the coordination service for this mode except to ensure the same nodes aren’t also accidentally generating the same random (which is statistically implausible anyway). There is no sequence or time to manage.
- Security & Opaqueness: The generated ID is effectively random gibberish to an outsider. It does not reveal the time or the origin. This is ideal for cases where IDs might be visible in URLs or through APIs and we don’t want users to infer how many items exist or when something was created. (For example, if user IDs were sequential, a malicious actor could scrape and guess valid IDs; using opaque random IDs mitigates that, at the cost of not being able to sort by creation without an external timestamp field).
- API Integration: The service will return the opaque ID likely as a string in UUID format. For example:
{"id": "de305d54-75b4-431b-adb2-eb6b9e546014"}
. Clients will treat it as an opaque token. If they need to store creation time, they must store that separately because the ID itself doesn’t convey it. In some cases, we might decide to use UUIDv7 (which is time-based) or ULID for opaque IDs if we wanted slight ordering, but that would leak time, making it not truly opaque. So we stick to purely random for the opaque variant. (We can document to clients that if they require ordering, they should use the time-sortable IDs; if they require non-guessable IDs, use opaque).
5.4 Data Storage and State Management
Our chosen design deliberately minimizes persistent storage:
- No ID database: We are not storing each generated ID in any database or log (beyond transient logs perhaps for debugging). This means the system doesn’t build up a storage burden proportional to IDs generated. We avoid the complexity of a distributed DB or cache to store IDs. The uniqueness is guaranteed by design rather than by checking against stored records. This is crucial for scalability (imagine trying to insert 100k keys/sec into a DB – not feasible without huge infrastructure). Instead, uniqueness is pre-guaranteed by the algorithm (via time + machine + sequence uniqueness, or via randomness space).
- Coordinator storage: The only storage of note is in the coordinator (ZooKeeper/etcd). ZooKeeper will store small znodes for each active worker (e.g., a few bytes each with maybe the worker’s address or ID). This is on the order of at most a thousand entries, which is tiny. ZK keeps this in memory and its own transaction log on disk (which is minimal here). The coordinator state is important to preserve (so that if everything restarts, we don’t double-assign IDs). But since it’s only a few entries, even a backup or reload is trivial. In worst case, if coordinator data is lost and all nodes restart, we could temporarily risk duplicate assignments – to prevent that, we’d either have a convention (like each node configured with an ID manually in that scenario) or require coordinator recovery. Generally, we will treat coordinator data with care (e.g., run it with persistence on disk, even though we use ephemeral nodes, we might also maintain a mapping record).
- Monitoring and Logging: We will have logs for operations and possibly an audit trail for issued IDs (at least for a short window) in memory to detect any anomaly (like duplicate detection). But logging every single ID to disk is not practical. Instead, we might log counts per second or significant events (like if the sequence had to roll over or if a clock went backward event happened). These logs help in debugging but are not part of the functional data path.
- Data for ID Validation: If needed, our service could offer a method to parse/validate IDs (e.g., given a purported ID, confirm if it’s a valid format and maybe decode timestamp). This doesn’t require stored data – it’s purely algorithmic (e.g., check length or bit patterns). For Snowflake IDs, one can decode the timestamp by shifting right 22 bits (in our 64-bit scheme) and adding the epoch, etc. That could be a utility API but not a major component.
5.5 Failure Recovery Details
-
Generator Node Failure: If an ID generator instance crashes or is removed, any IDs it already issued remain valid (they’re just numbers, already used by clients). The coordinator will detect the session loss (e.g., ZooKeeper notices the ephemeral znode gone) and free up that machine ID. Our system can automatically spin up a new generator instance (if using an orchestrator like Kubernetes or auto-scaling group) to replace it. The new instance will register and likely get the same machine ID if it’s the next available in that region. If it does get the same ID, we rely on the clock handling to avoid duplicate IDs. (We assume the new instance’s clock is current; if by chance the old instance had a slightly ahead clock, the worst case is the new one might produce IDs slightly lower than the old’s last ones, but since the old one is offline, duplicates still won’t occur – those lower IDs were never generated before). To be extra safe, we could configure that a restarting node uses a different machine ID than a recently failed one, but that may reduce our ID space if not managed. Given synchronized clocks, this isn’t a major concern.
-
Datacenter Outage: If an entire region’s datacenter goes down, those nodes stop generating. Clients can be configured to failover – e.g., the load balancer could route requests to a secondary region. Since region is encoded in IDs, an ID from a different region will still be unique. There’s no dependency between regions, so other regions continue normally. When the down region comes back, its nodes might start again (with time caught up) and generate IDs with their region bits, which will be higher than anything they generated before the outage (because the timestamp will be later). We should ensure the epoch and time bits make sense – if a region is down for a long time, it just wasn’t producing IDs, which is fine. There’s no “hole” problem globally, since IDs don’t need to be contiguous – uniqueness is enough.
-
ZooKeeper Failure: As mentioned, if the coordinator cluster (ZK) fails completely, no new nodes can join and if it’s down for long, an existing node losing session might be unable to re-establish (which in worst case could cause it to stop if it was programmed to require an active ZK session; we could design the generator to continue with its last known ID if ZK is down, to be more robust, as long as it doesn’t detect a conflict). We mitigate this by having a reliable ZK setup. Also, because ZK only manages node IDs and not each request, even a slow ZK doesn’t affect generation throughput. The system is mostly decoupled from ZK at runtime (especially if all nodes are already assigned). We might implement a cache such that if ZK is temporarily unreachable, a node just continues using its assigned ID until it can reconnect (this avoids it shutting down unnecessarily). Once ZK recovers, it can reconcile any changes. This is an engineering detail but ensures high availability.
-
Duplicate ID Safeguards: Although our logic guarantees uniqueness in theory, we can add runtime checks in debug mode – e.g., each node could keep a sliding window of recent IDs it generated (or just the last generated ID) to assert that new IDs are always larger (for time-sortable) or not equal to the last (for random). This is mainly to catch any bug. In production at scale, storing even a million recent IDs per node is too much overhead, but storing the last one or last timestamp is fine. We rely on the mathematical guarantees for the rest.
-
Scaling Beyond 1024 Nodes: If we ever needed more generator nodes concurrently (say the service became so popular we needed more than 1024 instances to handle load, or more than 32 regions, etc.), we have a few options:
- Increase the machine ID bits (at expense of reducing timestamp bits, shortening the time range). For example, 12 bits for machine and 10 for sequence (instead of 10/12) would allow 4096 nodes but only 1024 IDs/ms per node. If we had so many nodes, likely per-node throughput isn’t a problem, so that trade-off could work.
- Or deploy a second independent Snowflake service with a different epoch or different ID namespace (like prefix the IDs differently). But that complicates clients having to coordinate which service to call.
- Given each node can handle millions per second, needing >1024 concurrently active nodes is unlikely unless each is limited by other factors (like network or CPU). In practice, we likely won’t hit this limit. It’s an available headroom that 1024 nodes * millions IDs/sec is vastly above our target 100k/sec.
In conclusion, our detailed design chooses the Snowflake algorithm to fulfill the time-sortable, high-throughput ID needs, and a UUID-like random approach for opaque IDs. This covers both functional requirements. By carefully managing the bit schema, machine ID assignment, and clock behavior, we ensure global uniqueness and reliability. The design is battle-tested by industry (Twitter’s Snowflake and its derivatives) which underscores its viability. Meanwhile, using UUIDs for opaque IDs leverages a well-understood standard for uniqueness.
Our system is designed to scale horizontally and remain performant under heavy load. Here are key strategies and considerations for scalability:
-
Sharding and Partitioning: The beauty of the Snowflake approach is that it inherently shards the ID generation by machine. Each node is a shard (identified by its machine ID bits) that can work in parallel. We don’t have to manually partition requests; any incoming request can go to any node (for opaque IDs) or any specific region’s node (for time-sortable, we prefer local region node for latency). The assignment of unique node IDs via coordinator is essentially partitioning the ID space – each node handles a disjoint subset of IDs (distinguished by its machine bits). If we need to add more capacity, we add more nodes (up to the limit of machine ID space) and each new node takes up its slice of the ID space. This is linear scalability with respect to number of nodes. We also partition by region: each region’s cluster uses its region ID space. That way, adding a new region (if the service expands to new datacenters) is also just an extension of the partition scheme. For the segment DB approach (if we had used it), scaling might involve sharding the database (e.g., having separate counters for different “business tags” or using multiple sequence tables). But since we’re favoring no central DB, we avoid that complexity.
-
Load Balancing: Ensuring even distribution of requests prevents any single node from becoming a bottleneck. The load balancer (or a smart client library if we had one) should distribute calls roughly evenly among the healthy nodes in a region. Since all nodes do the same work and are stateless, a simple round-robin or least-connections strategy works. If one node starts lagging (maybe local issue), the LB should mark it unhealthy and bypass it. We’ll also use health checks on each node (a simple ping or test ID generation) to detect failures quickly. Because request processing is lightweight, typically CPU is the main factor – we might monitor CPU usage on nodes and adjust the number of nodes if they’re running too hot. If one region has disproportionate load, we can either overprovision nodes there or even route some requests to a less loaded region as a fallback (with a slight latency cost). The system is flexible: any node can theoretically serve any request, so geographic affinity is only for performance, not correctness.
-
Caching and Prefetching: For our chosen design, there isn’t much to cache – generation is computation, not a data fetch. One area of “prefetch” is in the segment allocation strategy: nodes would prefetch the next ID block before the current one runs out, to hide the DB latency. Since we aren’t primarily using that approach, this isn’t needed. However, one analogous idea: if a client knows it will need many IDs, it could request a batch (say 100 IDs in one call) and cache them client-side. This reduces calls per ID. Our service can support a batch API easily. Batching 100 IDs still only takes a millisecond or so to generate on a node (just loop or use vectorized operations), and reduces network roundtrips for the client by 100x. This is an optional performance optimization at the application level.
-
Sharded ID Ranges (if using DB): In a scenario where a centralized DB counter was unavoidable, one trick is to shard by key-space. For example, odd IDs could be generated by one DB and even IDs by another – doubling throughput without collisions. Or more generally, mod N sharding, where each of N sequences feeds a portion of IDs (like ID = sequence_value * N + shard_index). However, combining results from such sharded sequences loses global sorting (an ID from shard 0 might be higher than one from shard 1 even if created earlier, depending on the multiplier). Since we have Snowflake, we don’t need this, but it’s worth noting as a theoretical scaling method. Our design avoids multi-master DB issues entirely by not using the DB for generation in the critical path.
-
Sharding by Business Context: If some services use far more IDs than others, another approach (within a segment strategy) is to have separate sequences for separate contexts. For instance, one could maintain different counters for “user IDs”, “order IDs”, etc., each scaling independently. But again, with Snowflake/UUID, we don’t need separate counters – one unified service can generate for all contexts without risk of collision. If needed, we could prefix IDs by type, but that’s not required since all IDs are globally unique anyway. Instead, we might rely on the client to know what an ID refers to (or use an opaque prefix like the Cronofy example where they prefix “user_” or “order_” to an ID to indicate type – that’s an application-layer detail outside our core system).
-
Sharding the Coordinator: ZooKeeper itself can handle 1000+ nodes easily, as it’s only storing small data. But for safety, we distribute ZK’s ensemble nodes across different racks/zones. If we reached tens of thousands of generator nodes (which as discussed is unlikely), ZK might become a choke (in maintaining that many ephemeral nodes and heartbeats). In such a scenario, we could partition machine ID assignment by region (e.g., each region runs its own small coordinator for its machines, since regions are already disjoint in ID space). This localizes the coordination traffic. The global uniqueness still holds because region IDs differ. So effectively each region’s ZK manages up to, say, 64 machines (for 6-bit machine id example) – trivial for ZK. This way we wouldn’t have one giant ZK cluster for all 1000+ nodes, but multiple smaller ones.
-
Shadowing and Redundancy: We can run extra standby nodes that aren’t actively serving but can quickly take over if load increases or a node fails. However, given the ease of horizontal scaling, it might be simpler to just always run N+1 nodes beyond expected capacity. The coordinator doesn’t mind some unused IDs; they’re just not assigned.
-
Avoiding Single Points of Failure (SPOFs): We have addressed many SPOFs:
- No single database in the critical path (Snowflake eliminates that).
- The coordinator (ZK) is replicated and not needed per request, so it’s not a runtime SPOF.
- The load balancer can be redundant (multiple HA proxy instances or cloud LB service which is inherently redundant).
- Each region is independent, so a failure in one region doesn’t bring down others.
- If an entire region is down, the system can still serve from other regions (global availability, though clients might have higher latency).
- We should ensure the ID epoch (start time) is consistent across all nodes – this is a configuration setting. If one node had a different epoch, its IDs could clash or be out of order. This is just a deployment detail (use same config everywhere, and perhaps include epoch in the code). That’s a “consistency of configuration” SPOF to watch out for – solved by automation and verification in deployment.
-
Sharding Sequence Within a Node (micro-optimization): As mentioned, if one node has multiple threads frequently contending on the sequence counter, we could assign half the sequence range to one thread, half to another, to reduce lock contention. This effectively treats each thread as a sub-generator (with implicit extra bit for thread ID). However, given our throughput target and that languages like Java can atomically increment an
AtomicInteger
millions of times per second, this complexity likely isn’t necessary. If profiling shows lock contention at some extreme load, we can revisit this optimization. -
Sharding for Opaque IDs: For random IDs, no sharding needed because collision probability is already negligible. Each node does its own thing. If we were extremely paranoid, we could incorporate node ID into the random (e.g., generate 120 random bits and then append 8 bits of machine ID). This would guarantee no two machines ever output the same value (because their last 8 bits differ) and still leaves 120 random bits – practically the same collision probability. But this isn’t standard UUID and not necessary. Standard UUIDv1/v2 include MAC address for uniqueness; in v4 it’s purely random. We trust the randomness, but adding machine ID wouldn’t hurt except making the IDs slightly less opaque (someone could figure out which machine from those bits if they knew how we structured it). For simplicity, we’ll stick to standard random.
-
Latency considerations: The 200ms SLA is quite generous for what is basically a local computation plus network hop. In a single region, we expect ~1-5 ms responses. The 99th percentile might rise if there is a pause (e.g., if a node hits a sequence rollover and waits a millisecond, or if a GC pause in a managed language stalls a thread briefly). But even worst-case 10–20ms should cover it. Cross-region failovers might approach 100+ ms due to network distance, but those would be rare. We will monitor latency and ensure garbage collection or other system overhead doesn’t degrade it (possibly by using languages or settings that minimize stop-the-world pauses, since any pause halts ID generation temporarily). The design inherently is real-time; we’re not doing heavy I/O per request, so meeting latency is mostly about keeping the system healthy under load.
-
Sharding Logs/Monitoring: At high QPS, logging every request is impossible. We will sample logs or aggregate counts. For monitoring throughput, each node can maintain a counter of IDs generated and perhaps export metrics (like IDs/sec, sequence utilization, etc.). This helps ensure we know if we approach any limits (for example, if we start seeing sequence values consistently hitting 4090+ in some milliseconds, it indicates that node is at capacity and we should add nodes). Similarly, monitoring the drift of clocks could be done by comparing timestamps in IDs from different nodes occasionally (but that’s advanced; more simply NTP monitoring and alerts on any big adjustments suffice).
-
Trade-offs and Choices Recap: Our chosen design trades off a bit of complexity in coordination and clock management for a huge gain in performance and scalability. We deliberately do not provide absolute global ordering of IDs – strict ordering would require funneling through a single sequence or a global consensus each time, which is incompatible with the throughput needs. Instead, we get approximate ordering (by timestamp) which is sufficient for most use-cases (e.g., sorting by ID gives you almost-sorted by creation time, with rare inversions if clocks skew by a bit). We also trade the theoretical possibility of UUID collisions for the practicality of no central coordination – which is a good trade given the probabilities involved. By offering two types of IDs, we give flexibility: Snowflake IDs for when sorting or smaller size is needed, and UUIDs for when non-guessability is paramount. The cost is maintaining two code paths, but that’s minor. Each ID type has its domain of optimal use, and using the wrong one can have downsides (e.g., using Snowflake IDs in URLs where someone could guess an ID and fetch data might be a security risk, or using UUIDs as DB keys can bloat indexes). The system design covers both, so the architects of various services can choose accordingly.
In summary, this design is comprehensive and robust: it can easily scale to the required throughput and beyond by adding nodes (thanks to the distributed nature of ID generation) and meets the <200ms latency goal by keeping operations local and lightweight. We avoid single failure points by eliminating central databases in the hot path and using redundancy for the coordinator and nodes. The result is a globally unique ID generation system that provides two flavors of IDs while maintaining consistency, performance, and reliability at scale. Each component’s role is well-defined, and the overall system follows proven patterns used by high-scale platforms like Twitter and others, giving us confidence in its viability for our requirements.
.....
.....
.....
Table of Contents
Contents are not accessible
Contents are not accessible
Contents are not accessible
Contents are not accessible
Contents are not accessible