On this page
Understanding the Database Index
The Problem with Randomness
The Consequence: Fragmentation
The Pro Move: Sortable Identifiers
How Sortable IDs Work
Why This Architecture Wins
Modern Implementations
Conclusion
From UUID to Snowflake: Understanding Database Fragmentation


When you begin designing the architecture for a new software system, you face an immediate technical decision. You must determine how to uniquely identify every record in your database.
In a traditional, single-server environment, this is a straightforward task. You rely on the database to manage a counter for you.
The first record is 1, the second is 2, and the third is 3. This method, known as auto-increment, is efficient and easy to implement.
However, modern software development rarely stays on a single server.
As you scale, you might have multiple database nodes or several distinct services creating data simultaneously.
In this distributed environment, a simple counter fails.
You cannot have two different servers generating the identifier "100" at the same moment. They would need to coordinate with each other constantly to avoid duplicates, which slows down the entire system.
To solve this, developers often turn to the UUID (Universally Unique Identifier).
A UUID is a 128-bit label that is statistically guaranteed to be unique. You can generate one on any server, at any time, without checking with a central authority. It solves the coordination problem perfectly.
But there is a significant catch.
While UUIDs are excellent for uniqueness, they can be disastrous for database performance.
If you use a standard UUID as your Primary Key, you are likely introducing a hidden bottleneck that will slow down your write speeds and waste system resources as your data grows.
Understanding the Database Index
To understand why UUIDs cause performance degradation, you must first understand how a relational database stores data physically on the disk.
Most common databases, such as MySQL, PostgreSQL, and SQL Server, use a data structure called a B+ Tree to manage their indexes.
A B+ Tree is designed for speed. It allows the database to find, insert, and delete records with minimal effort.
The defining characteristic of a B+ Tree is that it is ordered.
The database does not store your records in the order they arrived. It stores them in the order of their Primary Key. This is often referred to as a Clustered Index.
The physical arrangement of the data on the hard drive mirrors the logical order of the keys.
When you use an auto-incrementing integer, the database is efficient. It receives the key "100", writes it to the disk, and then receives "101".
Because "101" is numerically greater than "100", the database simply appends the new record to the end of the current data file.
This is a Sequential Write. It is the fastest operation a disk drive can perform.
The database fills up one page of storage, seals it, and moves on to the next one.
The Problem with Randomness
Standard UUIDs (specifically Version 4) are completely random.
There is no pattern or sequence to them. An ID generated now might be alphanumerically smaller or larger than an ID generated five minutes ago.
When you force a B+ Tree to index random data, you break the sequential optimization.
Imagine your database has millions of rows. The B+ Tree is massive and perfectly sorted. You now ask the database to insert a new, completely random UUID.
The database looks at this new key and realizes it does not belong at the end of the file. Mathematically, this random value belongs somewhere in the middle of the existing data.
The database must now perform a series of expensive operations:
-
Traverse the Tree: It must search through the index to find the exact specific leaf node where this random value belongs.
-
Load the Page: It must pull that specific page of data from the disk into the computer's RAM.
-
Page Splitting: This is the critical failure point. Often, the page where the new key belongs is already full. The database cannot fit the new row in. To make space, it must perform a "split." It takes the full page, allocates a new empty page, and moves half of the existing data over.
-
Rebalancing: After the split, the database must update the parent nodes in the tree to point to the new locations.
The Consequence: Fragmentation
This process of constant splitting leads to Database Fragmentation.
Because pages are split in the middle to accommodate random inserts, you end up with thousands of data pages that are only half-full.
Your database now occupies significantly more disk space than the actual data requires.
Furthermore, this destroys your memory efficiency.
Databases rely on a "Buffer Pool" (a RAM cache) to hold frequently accessed data.
When you write sequentially, the database only needs to keep the most recent page in RAM.
When you write randomly, the database must constantly load old, random pages from the disk into RAM to modify them. This pushes valuable data out of the cache and forces the disk drive to work much harder.
This results in Random I/O, which is significantly slower than Sequential I/O.
As your table size increases, your write latency will degrade, and your CPU usage will spike.
The Pro Move: Sortable Identifiers
We are now faced with a conflict.
- Requirement A: We need the distributed uniqueness of a UUID.
- Requirement B: We need the sequential performance of an integer.
The solution used by large-scale systems is to create a hybrid identifier.
We need an ID that is globally unique but also roughly sorted by time.
This architectural pattern is often called K-Sortable IDs or Time-Based IDs.
The most famous implementation of this concept is Twitter Snowflake.
How Sortable IDs Work
The logic behind a Sortable ID is to construct the 64-bit or 128-bit key using distinct components, rather than generating a purely random string.
The structure typically follows this order:
- The Timestamp (Leading Bits): The first and most significant part of the ID is the current time in milliseconds.
- The Machine ID (Middle Bits): The next part is a unique number assigned to the specific server generating the ID.
- The Sequence Number (Ending Bits): The final part is a simple local counter that resets every millisecond to handle collisions on a single machine.
By placing the timestamp at the beginning of the ID, we satisfy the B+ Tree.
In a binary system, comparison starts from the most significant bit (the left) to the least significant bit (the right).
Because time always moves forward, an ID generated at 10:00 AM will always be numerically larger than an ID generated at 9:59 AM, regardless of which server created it.
Why This Architecture Wins
When you insert these sortable IDs into your database, the B+ Tree sees a stream of data that is always increasing.
The new record might not be strictly sequential (like 1, 2, 3), but it is always "newer" than the previous data. The database can append this new record to the end (or very near the end) of the index.
This restores the Sequential Write performance. The database fills up a page, closes it, and moves to the next. It eliminates the need for expensive Page Splits. It keeps your storage pages full and compact. It ensures your RAM cache remains efficient.
Modern Implementations
You do not always need to write a custom Snowflake generator. The industry has recognized this problem, and there are standard implementations available:
- UUID Version 7: Unlike the random Version 4, the new Version 7 standard is designed specifically for database keys. It embeds a Unix timestamp at the start of the UUID to ensure sortability.
- ULID: This is another popular format that combines a timestamp with random characters to create a sortable, unique string that is URL-safe.
Conclusion
In system design, choosing the path of least resistance often leads to technical debt. Using a standard random UUID is the default for many frameworks, but it ignores the physical reality of how databases function.
Here are the key takeaways:
- Indexes require order: Relational databases utilize B+ Trees, which are optimized for sequential data.
- Randomness is expensive: Inserting random keys forces the database to perform Page Splits, which increases disk I/O and CPU load.
- Fragmentation wastes resources: Random writes leave storage pages half-empty, bloating your database size and reducing cache efficiency.
- Sortable IDs are the solution: Using identifiers that start with a timestamp (like Snowflake or UUID v7) provides the uniqueness required for distributed systems while maintaining the high performance of sequential writes.
By shifting from random UUIDs to sortable IDs, you ensure your database architecture is scalable, efficient, and ready for high-volume traffic.
What our users say
Tonya Sims
DesignGurus.io "Grokking the Coding Interview". One of the best resources I’ve found for learning the major patterns behind solving coding problems.
Steven Zhang
Just wanted to say thanks for your Grokking the system design interview resource (https://lnkd.in/g4Wii9r7) - it helped me immensely when I was interviewing from Tableau (very little system design exp) and helped me land 18 FAANG+ jobs!
KAUSHIK JONNADULA
Thanks for a great resource! You guys are a lifesaver. I struggled a lot in design interviews, and this course gave me an organized process to handle a design problem. Please keep adding more questions.
Designgurus on Substack
Deep dives, systems design teardowns, and interview tactics delivered daily.
Access to 50+ courses
New content added monthly
Certificate of completion
$33.25
/month
Billed Annually
Recommended Course

Grokking the Advanced System Design Interview
37,294+ students
4.1
Grokking the System Design Interview. This course covers the most important system design questions for building distributed and scalable systems.
View Course