what are primary, and secondary indexes in mysql?
In MySQL, indexes are used to improve the speed of data retrieval operations on a table. There are different types of indexes, with primary and secondary being the most common. Here's a breakdown:
Definition: The primary index is created automatically when you define a PRIMARY KEY on a table.
Uniqueness: It enforces uniqueness β no two rows can have the same primary key value.
One per table: A table can have only one primary index.
Clustered Index: In InnoDB (the default storage engine), the primary index is a clustered index. This means:
Table data is physically stored in the order of the primary key.
The index includes the actual row data.
sqlCREATE TABLE users ( id INT PRIMARY KEY, name VARCHAR(100) );
Here,
id
is the
primary index.
Definition: Any index that is not the primary index is called a secondary index.
Multiple allowed: You can have many secondary indexes on a table.
Non-unique or unique: Secondary indexes can be unique or non-unique.
Structure in InnoDB:
A secondary index does not store the full row.
It stores the indexed column(s) plus the primary key to locate the actual row data.
This means an extra lookup is needed to fetch the full row (called a bookmark lookup or index lookup).
sqlCREATE INDEX idx_name ON users(name);
Here,
idx_name
is a
secondary index
on the
name
column.
Feature | Primary Index | Secondary Index |
---|---|---|
Uniqueness | Must be unique | Can be unique or not |
Quantity per table | Only one | Multiple allowed |
InnoDB behavior | Clustered (stores row data) | Non-clustered (references PK) |
Use case | Row identification | Query optimization |
Let me know if you want to see how indexing affects performance or how to choose indexes.
Sure! Here's a complete example of a MySQL table that uses both a primary index and a secondary index.
Suppose you have a
users
table where:
id
is
the
primary key
(and thus has a
primary index).
email
is a field you often search on, so you create a
secondary index
for it.
sqlCREATE TABLE users ( id INT PRIMARY KEY, -- Primary Index (auto-created on PRIMARY KEY) name VARCHAR(100), email VARCHAR(100), created_at DATETIME ); -- Create a secondary index on the 'email' column CREATE INDEX idx_email ON users(email);
Primary Index (on
id
):
The table's data is
physically ordered
by
id
.
The index includes all column data (name
,
email
,
created_at
).
Secondary Index (idx_email
):
The index stores
email
+ the associated
id
(primary key).
To retrieve full row data, MySQL uses the primary key to perform a lookup in the clustered index.
sqlSELECT * FROM users WHERE id = 101;
Efficient, uses the clustered primary index.
sqlSELECT * FROM users WHERE email = 'alice@example.com';
Uses the
idx_email
index to find matching email(s), then uses
id
to
fetch full row data.
If
email
must be unique:
sqlCREATE UNIQUE INDEX idx_email_unique ON users(email);
This still counts as a secondary index, but with a uniqueness constraint.
Let me know if you'd like to see how to analyze query
performance using
EXPLAIN
.
Great question!
When we say a table is "physically ordered" by the primary key in InnoDB (MySQL's default storage engine), it means:
The rows on disk are stored in the order of the primary key values, not in the order you insert them.
This is known as a clustered index structure.
Table data is stored in the order of insertion.
Indexes point to the location (file offset) of the row.
Table data is stored inside the primary key index itself.
So, the primary key index is the table.
As a result, rows are automatically sorted by the primary key on disk.
Suppose you have this data:
sqlINSERT INTO users (id, name) VALUES (3, 'Alice'), (1, 'Bob'), (2, 'Charlie');
Even though you inserted in order 3 β 1 β 2, InnoDB stores them on disk like:
1 β Bob 2 β Charlie 3 β Alice
Because the
primary key is
id
, and it's stored in
sorted B-tree order.
Fast primary key lookups: No extra step to find the row.
Range scans are efficient:
sqlSELECT * FROM users WHERE id BETWEEN 2 AND 5;
Reads rows in order without jumping around on disk.
Secondary indexes reference the primary key, so having a good primary key design impacts overall performance.
Let me know if you want a visual of the B-tree or how InnoDB organizes data on disk!
Perfect! Let's go through both:
In InnoDB, indexes (both primary and secondary) are stored using a B-tree structure. Here's a simplified view:
id
)
Suppose you have a
users
table with:
sqlid | name ------------- 1 | Bob 2 | Charlie 3 | Alice
id
)
[2] / \ [1] [3] Leaf nodes: 1 β Bob 2 β Charlie 3 β Alice
The
B-tree nodes
are ordered by
id
.
Leaf nodes contain entire row data.
This is why it's called "clustered" β the table is the index.
email
)
Letβs say you create a secondary index:
sqlCREATE INDEX idx_email ON users(email);
Hereβs how that looks internally:
email
)
[bob@a.com] / \ [alice@a.com] [charlie@a.com] Leaf nodes: alice@a.com β id=3 bob@a.com β id=1 charlie@a.com β id=2
The
leaf nodes
store the indexed column (email
) + the
primary key (id
).
To get full row data, InnoDB does a
lookup in the primary index
using the stored
id
.
Type | Stores Row Data? | Ordered By | Efficiency |
---|---|---|---|
Primary Index | β Yes | Primary Key | Fastest for primary key lookups |
Secondary Index | β No (just points to PK) | Indexed column | Extra lookup needed to get row |
Would you like to see a real
EXPLAIN
plan to show how MySQL uses these indexes in practice?
Great observation! In a
B-tree,
the
root node
being 2
in
the example is
just a simplification
to show how values are
organized logically, not how they are inserted or which value is
"first."
Let me explain more clearly.
2
the
Root?
In a B-tree:
The root node contains pointers to child nodes, helping to navigate to the correct leaf.
The value in the root is often the "split point" used to direct the search.
It doesnβt mean
2
was
inserted first or is more important β it's just a
middle value used to balance the tree.
So this example:
[2] / \ [1] [3]
means:
If you're looking for a value β€ 2, go left.
If you're looking for a value > 2, go right.
The tree is balanced, which is the key property of a B-tree.
In reality:
MySQL doesn't just make
2
the
root randomly.
It builds the tree incrementally as you insert data.
As more rows are added and nodes overflow, pages (blocks of storage) are split, and intermediate nodes are created.
Thatβs why the actual root value can change over time as the table grows.
Concept | Meaning |
---|---|
Root node value | Just a logical division point, not necessarily the first inserted. |
B-tree property |
Keeps tree
balanced
for fast lookups (O(log n) ).
|
Clustered index | Leaf nodes contain full row data, ordered by primary key. |
Secondary index | Leaf nodes contain indexed column + primary key. |
Would you like a real example of B-tree growth with inserts or a diagram showing how a B-tree splits and rebalances?
Great question! MySQL (specifically InnoDB) and PostgreSQL use very different approaches to indexing and physical row storage, especially regarding the primary key and clustering. Here's a clear comparison:
Feature | MySQL (InnoDB) | PostgreSQL |
---|---|---|
Primary index | Clustered index β table data is stored in the primary key B-tree. | Non-clustered by default. Data is stored separately from indexes. |
Table structure | Data is part of the primary key index. | Table = heap (unordered rows). Indexes point to tuples. |
Row lookup via index | Secondary indexes use primary key for lookup. | Indexes use a TID (tuple ID) pointer to locate rows. |
Index type | B+Tree (for both primary & secondary indexes). | B+Tree by default (also supports GIN, GiST, BRIN, etc.). |
Clustering | Always clustered by primary key in InnoDB. |
Can be manually clustered (via
CLUSTER ), but itβs
not maintained automatically.
|
MVCC (multi-version concurrency control) | Versioning handled in undo log. | Versioning stored in each row (via hidden xmin/xmax columns). |
HOT updates | Not supported. | Supports HOT (Heap Only Tuples) updates β avoids index rewrites for some updates. |
Only one primary index, and it defines the physical order of rows.
Secondary indexes must use the primary key to find data (extra lookup).
Efficient for range queries on primary key but rigid structure.
Rows are not stored in index order.
Each index stores a pointer (TID) to the row's physical location.
More
flexible, but can lead to
bloat
(dead tuples, need for
VACUUM
).
Allows more index types and indexing strategies.
sql-- Same table in both systems CREATE TABLE users ( id SERIAL PRIMARY KEY, email TEXT ); CREATE INDEX idx_email ON users(email);
In MySQL InnoDB:
Table is physically ordered by
id
.
idx_email
contains
email + id
, and uses
id
to find the full row.
In PostgreSQL:
Table is unordered (heap).
idx_email
contains
email + TID
β directly points to the heap location.
You can do this manually:
sqlCLUSTER users USING idx_email;
But:
It's a one-time operation.
The table won't stay clustered as you insert/update/delete.
Unlike MySQL, clustering is not automatic or maintained.
MySQL (InnoDB): Strongly coupled indexes and data β faster PK lookups, but rigid.
PostgreSQL: More flexible storage and indexing β better for
complex queries and custom index strategies, but
requires maintenance (e.g.,
VACUUM
).
Let me know if you want performance examples or a deep dive into PostgreSQL's MVCC or index types like GIN or BRIN.
Absolutely! Letβs dive into how PostgreSQL handles indexes, from the basics to some of its advanced capabilities.
In PostgreSQL, indexes are separate structures from the actual table data (heap). When you query a table using an indexed column, the index is searched first to locate the TID (tuple ID), which then points to the row in the table.
sqlCREATE TABLE users ( id SERIAL PRIMARY KEY, name TEXT, email TEXT ); CREATE INDEX idx_email ON users(email);
This creates a
B-tree index
on email
.
PostgreSQL uses a B+ Tree structure by default for most indexes.
The index contains:
The
indexed column(s)
(e.g.,
email
)
A TID (tuple ID) which points to the exact row in the heap
Index (B-tree) ---------------------- email: alice@example.com β (TID = page 5, slot 2) email: bob@example.com β (TID = page 6, slot 1)
Then it uses the TID to retrieve the actual row in the heap.
PostgreSQL supports multiple index types, which is a key strength:
Index Type | Use Case |
---|---|
B-tree |
Default. Fast for equality and range queries
(= ,
< ,
> ,
BETWEEN ).
|
Hash | Fast for equality only. Rarely used. |
GIN | Great for full-text search and arrays. |
GiST | Spatial and geometric queries (PostGIS). |
BRIN | Efficient for very large tables with naturally ordered columns. |
SP-GiST | Used for non-balanced data like IPs, quadtrees. |
sqlEXPLAIN ANALYZE SELECT * FROM users WHERE email = 'alice@example.com';
This might return:
Index Scan using idx_email on users Index Cond: (email = 'alice@example.com')
This means:
PostgreSQL uses
idx_email
to find the TID(s).
Then it fetches the matching row(s) from the heap.
PostgreSQL uses MVCC (Multi-Version Concurrency Control), which:
Keeps multiple versions of a row in the heap.
Requires background maintenance (VACUUM
) to clean up old versions.
Indexes always point to one version of the row, but that version might be dead β this is why VACUUM is critical.
You can manually make a table match an index order:
sqlCLUSTER users USING idx_email;
This rewrites the table to match
idx_email
's order.
But it wonβt stay clustered over time unless you re-run it manually.
Feature | PostgreSQL Index Behavior |
---|---|
Index structure | Separate from heap (points via TID) |
Default index type | B-tree |
Maintains row order? | No (heap) |
Row versioning | MVCC with VACUUM |
Unique strengths | Many index types, customizable indexing logic |
Let me know if you'd like a demo of full-text search with GIN indexes, or an example of composite and partial indexes!
Excellent question! In PostgreSQL, what we commonly call secondary indexes (anything that's not the primary key) are handled in a powerful and flexible way β but very differently from MySQL's InnoDB.
In PostgreSQL, all indexes (including primary and unique constraints) are stored separately from the table data (heap). This means:
The primary key is not clustered by default.
Secondary indexes work the same way structurally as a primary index β theyβre just another index.
A secondary index in PostgreSQL:
Contains:
The indexed column values
A TID (tuple ID) β a physical pointer to the row in the heap (page number + row offset)
It does not store the full row.
The TID is used to fetch the actual row from the heap.
Suppose you have:
sqlCREATE TABLE users ( id SERIAL PRIMARY KEY, email TEXT, created_at TIMESTAMP ); CREATE INDEX idx_created_at ON users(created_at);
Now PostgreSQL stores
idx_created_at
like:
B-tree Index (idx_created_at) ----------------------------- '2023-01-01' β (page 12, slot 5) '2023-01-02' β (page 13, slot 3)
So when you run:
sqlSELECT * FROM users WHERE created_at = '2023-01-01';
PostgreSQL scans the index.
Finds the TID (e.g.,
page 12, slot 5
).
Uses the TID to fetch the full row from the heap.
This is known as an "Index Scan" + Heap Fetch.
If your query only needs columns in the index, and PostgreSQL verifies that the row is visible (not outdated), it can skip the heap fetch entirely.
sqlCREATE INDEX idx_email_only ON users(email); SELECT email FROM users WHERE email = 'test@example.com';
PostgreSQL may do an Index-Only Scan β faster, since it avoids heap access.
Note: This only works if PostgreSQL's visibility map shows the row is not dead (i.e., no VACUUM needed).
Feature | PostgreSQL Behavior |
---|---|
Index location | Stored separately from table (heap) |
Lookup method | Uses TID (tuple ID) to point to actual row |
Works the same as primary? | Yes, structurally identical |
Full row access needed? | Yes (unless it's an index-only scan) |
Maintains table order? | No β heap is unordered |
Clustered automatically? |
No (must use
CLUSTER , and itβs not maintained)
|
Would you like to see a real example with
EXPLAIN ANALYZE
to show how PostgreSQL executes queries with secondary
indexes?
Awesome! Letβs walk through a
realistic example
using PostgreSQLβs
EXPLAIN ANALYZE
to see how it handles
secondary indexes.
sqlCREATE TABLE users ( id SERIAL PRIMARY KEY, name TEXT, email TEXT, created_at TIMESTAMP ); -- Secondary index on `email` CREATE INDEX idx_email ON users(email);
Now letβs assume the table has many rows, and we want
to query by
email
:
sqlSELECT * FROM users WHERE email = 'alice@example.com';
EXPLAIN ANALYZE
:
sqlEXPLAIN ANALYZE SELECT * FROM users WHERE email = 'alice@example.com';
Index Scan using idx_email on users (cost=0.29..8.30 rows=1 width=72) Index Cond: (email = 'alice@example.com') Actual time: 0.015..0.017 rows=1 loops=1 Planning Time: 0.095 ms Execution Time: 0.031 ms
Line | Meaning |
---|---|
Index Scan |
PostgreSQL used the
secondary index
idx_email .
|
Index Cond |
It searched the index for
email = 'alice@example.com' .
|
Heap Fetch | Not shown explicitly, but it did one heap lookup using the TID from the index. |
Execution Time | Very fast β index usage was effective. |
Now letβs try a query that could avoid the heap fetch:
sql-- Rebuild the visibility map with vacuum to enable index-only scan VACUUM ANALYZE users; EXPLAIN ANALYZE SELECT email FROM users WHERE email = 'alice@example.com';
Index Only Scan using idx_email on users (cost=0.29..8.30 rows=1 width=32) Index Cond: (email = 'alice@example.com') Heap Fetches: 0
Heap Fetches: 0 β β PostgreSQL didn't touch the heap. It got everything it needed from the index!
Query Type | Index Used? | Heap Access? | Fastest? |
---|---|---|---|
SELECT *
|
β Yes | β Yes | No (needs heap fetch) |
SELECT email
|
β Yes | β Maybe | β If Index-Only Scan is possible |
Would you like a full script to try this yourself with test data, or a diagram of the planner in action?
Great follow-up! In PostgreSQL, when you index two different columns, there are two primary strategies:
(email, created_at)
)
sqlCREATE INDEX idx_email_created_at ON users(email, created_at);
PostgreSQL stores the index like this:
(email, created_at) β TID
B-tree Order Example:
('alice@example.com', '2023-01-01') β TID ('alice@example.com', '2023-01-02') β TID ('bob@example.com', '2023-01-01') β TID
Can be used for:
WHERE email = 'alice@example.com'
WHERE email = 'alice@example.com' AND
created_at = '2023-01-01'
WHERE email = 'alice@example.com' AND
created_at > '2023-01-01'
Cannot efficiently use:
WHERE created_at = '2023-01-01'
(email must come first)
PostgreSQL uses
left-most prefix
of the index.
If you want to search by
created_at
alone, this index
wonβt help
much.
sqlCREATE INDEX idx_email ON users(email); CREATE INDEX idx_created_at ON users(created_at);
Two distinct indexes:
idx_email: ('alice@example.com') β TID ('bob@example.com') β TID idx_created_at: ('2023-01-01') β TID ('2023-01-02') β TID
If your query is:
sqlSELECT * FROM users WHERE email = 'alice@example.com' AND created_at = '2023-01-01';
PostgreSQL may use:
Just one index (and filter on the other column)
Or do a Bitmap Index Scan: it combines both indexes efficiently!
sqlSELECT * FROM users WHERE email = 'alice@example.com' AND created_at = '2023-01-01';
EXPLAIN ANALYZE
Output:
Bitmap Heap Scan on users Recheck Cond: (email = 'alice@example.com' AND created_at = '2023-01-01') -> Bitmap Index Scan on idx_email -> Bitmap Index Scan on idx_created_at
PostgreSQL scans both indexes.
Intersects the results using a bitmap in memory.
Then fetches matching rows from the heap.
Approach | Benefit | Limitation |
---|---|---|
Composite Index | Very fast for queries matching index prefix | Useless if query skips left-most column |
Separate Indexes | More flexible for varying conditions | May be slower; might use Bitmap Scan |
Would you like a demo dataset and a full
EXPLAIN
plan comparing both strategies?