0% completed
Database indexes are designed to improve the speed and efficiency of data retrieval operations. They function by maintaining a separate structure that points to the rows in a table, allowing the database to look up data more quickly without scanning the entire table.
There are various types of database indexes, each with its unique characteristics and use cases. Understanding these different index types is crucial for optimizing the performance of database systems and ensuring efficient data retrieval.
In this section, we will explore several common types of database indexes.
Primary Index
A Primary Index is the index on a table’s primary key. The primary key is a column (or set of columns) that uniquely identifies each row. Most relational databases automatically create an index on the primary key to enforce uniqueness and allow fast lookup by the primary key. There can only be one primary index per table because a table has only one primary key. This index usually organizes the data by the primary key values, especially in systems where the primary key index is also the clustered index (e.g. InnoDB in MySQL or MS SQL Server by default).
Performance considerations: A primary index provides very efficient access to records by the primary key. Lookups and joins based on the primary key are fast because the index is typically a B-tree optimized for finding the exact key. Since the primary index is unique by definition, the database can stop searching as soon as it finds the one matching row. Maintaining a primary index has minimal overhead beyond any other index – inserts and updates to primary-key values (which are rare, since primary keys seldom change) will update the index. The primary index also helps enforce referential integrity (foreign keys often reference a primary key index for fast checks). In summary, the primary index improves read performance for identifying specific rows, with very little downside since every table usually needs a primary key.
Use cases: Use a primary index on the main identifier of the table. This is usually done automatically by declaring a PRIMARY KEY. For example, an Employees
table might use an EmployeeID
as the primary key; the database will create a primary index on EmployeeID
so that queries like SELECT * FROM Employees WHERE EmployeeID = 123
are very fast. Essentially every relational table that has a primary key will have a primary index – it’s the foundational index for uniquely identifying rows.
Example: Creating a table with a primary key will automatically create a primary index on that key. For instance, in SQL:
CREATE TABLE Customers ( CustomerID INT PRIMARY KEY, Name VARCHAR(100), Address VARCHAR(200) );
Here, CustomerID
is the primary key, and the database builds a primary index on it so that any lookup by a specific CustomerID
is efficient.
Unique Index
A Unique Index ensures that all values in the indexed column (or combination of columns) are distinct. In other words, it enforces a uniqueness constraint – no two rows can have the same key value. Database systems often automatically create a unique index when you define a UNIQUE constraint or a primary key (primary keys are by nature unique). You can also create additional unique indexes on other columns that must be unique (for example, a username or email address in a users table). A unique index functions like a regular index for lookup performance, but with the added rule that duplicate values are not allowed.
Performance considerations: Unique indexes speed up read queries just as non-unique indexes do, and in addition they guarantee data integrity by preventing duplicates. Reads (SELECTs) using a unique index are very efficient – and the database also knows that at most one row will match, which can slightly optimize certain plans. The trade-off is on writes: whenever a new row is inserted or an indexed column is updated, the database must check the unique index to ensure no duplicate value exists, which adds a small overhead. This means inserts/updates can be a bit slower due to the uniqueness check. However, this cost is usually worth it to maintain data correctness. Storage overhead for a unique index is similar to a normal index.
Use cases: Use unique indexes for any column that requires unique values. Besides primary keys, this includes candidate keys or business rules – e.g. an email address in a users table (to ensure no two users share the same email), or a social security number field that must be unique per person. Unique indexes are optimal when you frequently query by that column as well, because they both enforce integrity and accelerate lookups. Keep in mind that you can have multiple unique indexes per table (unlike primary index which is one).
Example: Creating a unique index (or constraint) on a column. Suppose we have a Users
table and want Email
to be unique:
-- Create a unique index on the Email column to prevent duplicates CREATE UNIQUE INDEX idx_users_email ON Users(Email);
After this, any SELECT
or join by Email
can use the index, and any attempt to insert a duplicate email will be blocked by the index. Thus, queries like SELECT * FROM Users WHERE Email = 'alice@example.com'
are fast, and the index ensures there’s at most one 'alice@example.com'
in the table.
Clustered Index
A Clustered Index determines the physical order of data in the table. In a clustered index, the table’s rows are stored on disk in the same order as the index key (essentially the index is the table). This means that there is no separate structure holding pointers to data – the index’s leaf nodes contain the actual data rows. Because of this, a table can have only one clustered index (the data can only be sorted one way on disk). Often, the primary key serves as the clustered index by default (e.g. in SQL Server or InnoDB), but it can be a different column if chosen. If a table has a clustered index, other indexes on that table are called non-clustered (or secondary) indexes.
Performance considerations: Clustered indexes are very efficient for range queries and sorting on the indexed key, because the data is already ordered on disk. For example, if you query a range of values (say, all records between two dates on a date-column clustered index), the database can retrieve that continuous block of rows directly from disk with minimal seeking. Similarly, retrieving data in sorted order on the cluster key is fast (no extra sorting step) because the rows are stored sorted. Clustered indexes thus shine for range scans, ordering, and grouping by the indexed column. They also tend to improve locality of reference – adjacent rows (by index order) are stored near each other, which can benefit I/O when reading many sequential rows.
However, clustered indexes have some trade-offs. Writing data can be slower if new rows need to be inserted in sorted order; inserting a row in the middle of existing data may require shifting rows or splitting data pages to maintain the order. Updates to the cluster key (if they occur) are expensive for the same reason (the row might move). Also, since there can be only one clustered index, you must choose it wisely based on query patterns. If you frequently need data sorted or ranged by a particular column, that’s a good candidate. If not, many systems default to clustering on the primary key. Additionally, non-clustered indexes on a clustered table use the clustered key as a pointer to locate data (instead of a direct physical pointer), which adds a bit of size to those indexes.
Use cases: Use a clustered index for columns on which you often retrieve ordered data or ranges of data. A classic use case is a date/timestamp column for a log or history table – clustering on the date makes it efficient to fetch a time range or the latest records. Similarly, if you always query a customer table sorted by LastName
, a clustered index on LastName
makes sense (as long as insert patterns are manageable). Many OLTP databases cluster on an auto-incrementing ID (primary key) because new inserts then append to the end (minimizing reorganization), and joins by ID are fast. In data warehouses, clustering on a date or other sequential key can benefit range queries. If a table is small, clustering is less critical, but on large tables the choice of clustered index can significantly impact performance for certain queries.
Example: In SQL Server (or other systems that support explicit clustered indexes), you can create one like so:
-- Cluster the Customer table by last name, then first name (alphabetical order) CREATE CLUSTERED INDEX idx_cust_name ON Customers(LastName ASC, FirstName ASC);
This physically sorts the Customers
table by last name and first name. Queries such as SELECT * FROM Customers ORDER BY LastName, FirstName
or range queries like “LastName BETWEEN 'A' and 'C'” will be very efficient because the data is stored in sorted order. Keep in mind that only one clustered index (here on LastName+FirstName) can exist – if the primary key is different, that primary key would be supported by a non-clustered unique index instead.
Non-Clustered Index
A Non-Clustered Index (also called a secondary index) is a standard index separate from the actual data storage. In a non-clustered index, the index structure (typically a B-tree) stores the indexed column values and pointers (references) to the actual table rows that contain the rest of the data. The table’s data is not sorted by this index; instead, the index is like a lookup table that can quickly find the locations of rows matching a given key. Non-clustered indexes are analogous to an index in a book – the book’s content is in its original order, and the separate index lists keywords with pointers (page numbers) to where those topics are found. A table can have multiple non-clustered indexes on different columns, since they don’t dictate physical order.
Performance considerations: A non-clustered index greatly speeds up queries that filter or join on the indexed column, by allowing the database to jump directly to matching records instead of scanning the whole table. For example, an index on LastName
in an Employees
table lets a query WHERE LastName = 'Smith'
quickly find all “Smith” entries. Non-clustered indexes are very flexible – you can create many of them to support different queries – but each additional index adds storage overhead and slows down writes (each insert/update/delete must update all relevant indexes). When a query uses a non-clustered index, the database will traverse the index to find matching key values, then follow the pointers to retrieve the full rows. If the index covers the query (contains all needed columns), it may not need to fetch the table rows (see Covering Index below); otherwise, there is an extra step (called a bookmark lookup or index-to-table lookup) to get the remaining data. This lookup is generally fast for a few records, but if a non-clustered index returns many matches, the benefit may diminish because each row retrieval incurs additional I/O.
Compared to clustered indexes, non-clustered indexes do not guarantee physical locality of data – the rows may be scattered – so range scans will fetch each matching row via pointers. If many rows qualify, a table scan might even be cheaper, which the optimizer will consider. Thus, non-clustered indexes are best for selective queries (that return a small percentage of the table). They are not helpful for queries that return a very large fraction of the table (in such cases, a full table scan is usually more efficient than hopping through the index).
Use cases: Virtually any column that is frequently used in WHERE
clauses, join conditions (ON
clauses), or used for sorting (ORDER BY
) is a candidate for a non-clustered index if it isn’t already the clustered (or primary) index. For example, if you often query an Orders
table by CustomerID
, and the table is not clustered by customer, you would create a non-clustered index on CustomerID
to speed up those lookups. Similarly, an index on LastName
in an Employees
table helps searches by last name, and an index on ProductCategory
in a Products
table accelerates filtering by category. You can have multiple non-clustered indexes per table, so you tailor them to your query patterns. Just be cautious with creating too many; each index will need maintenance on writes and consumes disk/RAM.
Example: Creating a non-clustered index on a single column is straightforward:
-- Create a non-clustered index on the LastName column of Employees CREATE INDEX idx_emp_lastname ON Employees(LastName);
This index allows fast searches by last name. For instance, SELECT * FROM Employees WHERE LastName = 'Doe'
will use the index to find pointers to all "Doe" records and then retrieve them. The Employees
data itself remains in its original order (perhaps clustered by another key or as inserted), but this index serves as a quick lookup structure for the LastName field.
Composite Index
A Composite Index (also called a multi-column index or concatenated index) is an index on two or more columns combined. The index treats the combined set of columns as a single search key. For example, an index on (LastName, FirstName)
in a table effectively sorts by LastName then FirstName together. Composite indexes can speed up queries that involve conditions on multiple columns because the query can use one index lookup instead of having to combine results from separate single-column indexes.
Performance considerations: A well-chosen composite index can significantly improve performance for multi-column queries. If a query has a WHERE
clause involving both column A and B, a composite index on (A, B) can directly locate the matching (A,B)
pairs. In contrast, using separate single-column indexes might require the database to perform an index intersection or do additional filtering. Composite indexes also help with sorting and grouping on multiple columns. They can even act as a covering index if they include all columns the query needs (either as index keys or included columns). Additionally, composite indexes can save space and maintenance cost by consolidating into one index what might otherwise require two or more indexes. For instance, instead of having one index on A and another on B, a composite on (A,B) might serve both needs if queries commonly use both columns together.
However, the order of columns in a composite index is crucial. Most databases use a leftmost prefix rule: the index can be used for a query that filters on the first column, or on the first and second, etc., but not solely on a later column without the earlier one. For example, an index on (A, B) can be used for a query on A alone, or A and B together, but typically not for a query on B alone (if A is not specified). (Some databases have optimizations like index skip scans, but those are exceptions.) This means if you create a composite index, you should list the columns in the order that matches your common query patterns. Composite indexes tend to be larger (since they store multiple values per entry) and will incur overhead on any operation that changes any of the indexed columns.
Use cases: Use composite indexes for queries that consistently involve multiple columns. A classic example is a table of addresses with columns (City, State); queries often filter by State and City together, so an index on (State, City) makes sense. Another example is an Orders
table where queries ask for orders by Customer within a certain Date – an index on (CustomerID, OrderDate) would efficiently support WHERE CustomerID = ? AND OrderDate BETWEEN ...
. Composite indexes are also useful for enforcing uniqueness across a combination of fields (e.g. an index on (FirstName, LastName) if you wanted to ensure no two people share the same full name in a table – though that’s a business requirement not commonly enforced, it illustrates the point). Always consider how the columns are used: if a composite index’s first column isn’t used alone in queries, it may not be helpful to have a separate single-column index on it, whereas if the second column is also frequently queried by itself, you might need an additional index just on that second column.
Example: Suppose we have an Orders
table frequently queried by Customer and Date. We can create a composite index:
-- Composite index on CustomerID and OrderDate CREATE INDEX idx_orders_cust_date ON Orders(CustomerID, OrderDate);
This single index can accelerate queries like SELECT * FROM Orders WHERE CustomerID = 42 AND OrderDate >= '2024-01-01' AND OrderDate < '2024-02-01';
because it will find the customer 42’s orders and then filter by the date range in one index traversal. It will also help a query WHERE CustomerID = 42
alone (using the leftmost CustomerID
part). However, a query filtering only by OrderDate
(without CustomerID) cannot use this index efficiently because OrderDate
is the second component. In summary, this composite index targets queries that specify a customer and a date or date range.
Full-Text Index
A Full-Text Index is a specialized index for text search. Unlike a normal index which treats an entire column value as a single entry, a full-text index breaks text data into words or tokens and indexes those for rapid substring and keyword searches. Full-text indexes are typically implemented as an inverted index: it maps each word (term) to a list of documents or rows that contain that word. This allows efficient search for documents containing one or more words, supporting features like phrase searches, boolean text searches, and relevancy ranking. Full-text indexing is essential for searching large text fields such as articles, product descriptions, documents, or comments, where queries might ask for "rows that contain the word X".
Performance considerations: Full-text indexes greatly improve the performance of text searches. For example, a query to find all articles containing the word “database” can use the full-text index to directly retrieve matches, instead of scanning every article’s text. Traditional B-tree indexes cannot efficiently handle LIKE '%keyword%'
searches because they are not optimized for arbitrary substring matches. A full-text index, by indexing individual words, can handle such queries in sub-linear time. However, maintaining a full-text index has a cost. Building the index initially can be resource-intensive (as it involves parsing all text, removing common stop-words, stemming words depending on language, etc.), and updates to text columns require updating the inverted index. In many systems, full-text index updates might be batch-processed or asynchronous to mitigate overhead. Full-text indexes also consume additional storage, as they store a lot of tokens (though often with compression). Another consideration is that full-text search engines typically operate with specific queries (e.g., using MATCH() ... AGAINST()
in MySQL or CONTAINS()
in SQL Server) – the query engine needs to use the full-text search mechanism explicitly.
Despite the overhead, for applications that need to search text, full-text indexes are the only practical way to achieve good performance. They often support advanced features: ranking results by relevance, ignoring common “noise” words, support for inflection or stemming (searching “run” also finds “running”), etc., which go beyond what a basic index does.
Use cases: Use a full-text index for columns containing large unstructured text that users need to search through. Typical use cases are: article or blog content, product descriptions (for e-commerce search), document repositories, comment or message text, etc. For example, a support knowledge base might have a full-text index on the article text so that a query for certain error keywords returns relevant articles quickly. Full-text search is also used in implementing search boxes on websites or applications – e.g., searching all fields of a customer profile for a keyword. If your queries are only looking for exact matches or simple prefixes, a regular index might suffice, but for sophisticated text queries (multiple keywords, any position in text, relevance scoring), full-text is ideal.
Example: In MySQL, you can create a full-text index on a text column like this:
-- Full-text index on the Content column of an Articles table (MySQL example) CREATE FULLTEXT INDEX idx_articles_content ON Articles(Content);
Once created, queries can use MATCH(Content) AGAINST('database')
to efficiently find articles that mention “database”. In SQL Server, full-text indexing involves creating a full-text catalog and index, and then using the CONTAINS
or FREETEXT
predicates for querying. The idea in all cases is the same: the text is tokenized and indexed by terms, enabling fast searches for those terms in the text data (far faster than wildcard LIKE queries on large text).
Hash Index
A Hash Index uses a hash table internally for indexing, rather than a tree structure. In a hash index, the index key (e.g., a value from a column) is run through a hash function which determines which "bucket" or slot the pointer to the row will be stored in. Hash indexes are excellent for equality lookups – i.e., queries that ask for records matching a specific value (using =
or IN
comparisons). The hash function computation leads directly to the location of the matching entries, giving average O(1) lookup time, which can be faster than traversing a B-tree (O(log N)) for very large datasets. However, hash indexes do not maintain any order among the keys, so they cannot be used for range scans (<
, >
, BETWEEN queries) or for sorting.
Performance considerations: For exact-match queries, a hash index is extremely fast. If you query WHERE AccountNumber = 12345
, a hash index on AccountNumber can compute the hash and jump directly to the matching entry (if any) in constant time. This makes hash indexes attractive for workloads that are heavily key-value oriented (many lookups by exact key). Additionally, the performance of hash indexes doesn’t degrade much even as the table grows, except that a very full hash table might have collisions (multiple values hashing to the same bucket) which the index must handle (commonly via linked list or overflow areas). In case of collisions, lookup becomes a bit slower because the index might need to check a few entries, but a good hash function minimizes this.
On the downside, because hash indexes lack ordering, they cannot be used for range queries or ORDER BY operations. If you do WHERE Name > 'M'
or ORDER BY Name
, a hash index on Name is useless – the database would have to scan the table because the hash index doesn’t store keys in sequence. Also, hash indexes typically only support simple equality, not even prefix matches (like LIKE 'John%'
cannot use a hash index effectively). Another consideration is that some databases do not support general hash indexes, or they have limitations. For example, PostgreSQL supports hash indexes but for many years they were not WAL-logged (meaning they weren’t crash-safe) until recent versions; even now, B-tree indexes are usually preferred unless you have a very specific need. MySQL’s Memory engine uses hash indexes by default for in-memory tables, because memory tables are often used as fast lookup caches. But InnoDB (the main storage engine) doesn’t allow user-created hash indexes explicitly – it relies on B-trees, sometimes using adaptive hashing behind the scenes for frequently accessed pages. So, the availability of hash indexes depends on the database system.
In terms of maintenance, updating a hash index on inserts and deletes is generally fast (compute a hash and insert or remove an entry). If the hash table grows too full, it might need resizing (rehashing), which can be an expensive operation, but that’s not frequent if sized well. Hash indexes also typically don’t provide any statistical information about data distribution (which query optimizers use for planning) beyond perhaps the number of entries, whereas B-trees inherently have some ordering structure that can be exploited.
Use cases: Use a hash index when you have a very large table and frequently need to retrieve or check existence of records by a specific key value, and you never need to do range queries on that key. This is common in key-value stores or certain NoSQL systems, but in SQL/RDBMS it’s less common since B-tree indexes handle most use cases well. That said, a hash index might be beneficial for something like an in-memory lookup table, or a table that serves as a cache of key-to-value mappings where each query is an exact key lookup. Another scenario is indexing cryptographic hashes or uniformly distributed identifiers – if you frequently look up by a long hash value, a hash index would directly target that value. In PostgreSQL, if you have a table where only equality searches are used on a particular column and performance is critical, you could consider a hash index for that column (after ensuring it’s supported and advantageous).
Example: In PostgreSQL, creating a hash index explicitly looks like:
-- PostgreSQL: Create a hash index on the CustomerCode column CREATE INDEX idx_customer_code_hash ON Customers USING HASH(CustomerCode);
This creates a hash-based index on CustomerCode
. Now a query SELECT * FROM Customers WHERE CustomerCode = 'ABC123'
will use the hash index to directly hash 'ABC123'
and find the matching row(s) very quickly. But a query with a range or inequality on CustomerCode
(e.g. CustomerCode < 'ABC999'
) cannot use this index. In MySQL (Memory engine), if you create a table ENGINE=MEMORY
, any index you create on it is a hash index by default, optimized for fast lookups in RAM. For disk-based tables, MySQL doesn’t allow specifying USING HASH
(as of current versions) – it sticks to B-trees. Always consider whether your queries are exclusively equality lookups; if so, and your DB supports it, a hash index can be a powerful tool for performance.
.....
.....
.....
Table of Contents
Contents are not accessible
Contents are not accessible
Contents are not accessible
Contents are not accessible
Contents are not accessible