Primary vs Secondary Indexes

what are primary, and secondary indexes in mysql?

In MySQL, indexes are used to improve the speed of data retrieval operations on a table. There are different types of indexes, with primary and secondary being the most common. Here's a breakdown:


Primary Index

  • Definition: The primary index is created automatically when you define a PRIMARY KEY on a table.

  • Uniqueness: It enforces uniqueness β€” no two rows can have the same primary key value.

  • One per table: A table can have only one primary index.

  • Clustered Index: In InnoDB (the default storage engine), the primary index is a clustered index. This means:

    • Table data is physically stored in the order of the primary key.

    • The index includes the actual row data.

Example:

sql
CREATE TABLE users ( id INT PRIMARY KEY, name VARCHAR(100) );

Here, id is the primary index.


Secondary Index

  • Definition: Any index that is not the primary index is called a secondary index.

  • Multiple allowed: You can have many secondary indexes on a table.

  • Non-unique or unique: Secondary indexes can be unique or non-unique.

  • Structure in InnoDB:

    • A secondary index does not store the full row.

    • It stores the indexed column(s) plus the primary key to locate the actual row data.

    • This means an extra lookup is needed to fetch the full row (called a bookmark lookup or index lookup).

Example:

sql
CREATE INDEX idx_name ON users(name);

Here, idx_name is a secondary index on the name column.


Summary

Feature Primary Index Secondary Index
Uniqueness Must be unique Can be unique or not
Quantity per table Only one Multiple allowed
InnoDB behavior Clustered (stores row data) Non-clustered (references PK)
Use case Row identification Query optimization

Let me know if you want to see how indexing affects performance or how to choose indexes.

give me an example with both of them?

Sure! Here's a complete example of a MySQL table that uses both a primary index and a secondary index.


Example: Users Table

Suppose you have a users table where:

  • id is the primary key (and thus has a primary index).

  • email is a field you often search on, so you create a secondary index for it.

SQL Code:

sql
CREATE TABLE users ( id INT PRIMARY KEY, -- Primary Index (auto-created on PRIMARY KEY) name VARCHAR(100), email VARCHAR(100), created_at DATETIME ); -- Create a secondary index on the 'email' column CREATE INDEX idx_email ON users(email);

What Happens Internally (In InnoDB)

  • Primary Index (on id):

    • The table's data is physically ordered by id.

    • The index includes all column data (name, email, created_at).

  • Secondary Index (idx_email):

    • The index stores email + the associated id (primary key).

    • To retrieve full row data, MySQL uses the primary key to perform a lookup in the clustered index.


Query Examples

1. Use the Primary Index:

sql
SELECT * FROM users WHERE id = 101;

Efficient, uses the clustered primary index.

2. Use the Secondary Index:

sql
SELECT * FROM users WHERE email = 'alice@example.com';

Uses the idx_email index to find matching email(s), then uses id to fetch full row data.


Bonus Tip: Unique Secondary Index

If email must be unique:

sql
CREATE UNIQUE INDEX idx_email_unique ON users(email);

This still counts as a secondary index, but with a uniqueness constraint.


Let me know if you'd like to see how to analyze query performance using EXPLAIN.

What do you mean by physically ordered?

Great question!

When we say a table is "physically ordered" by the primary key in InnoDB (MySQL's default storage engine), it means:

The rows on disk are stored in the order of the primary key values, not in the order you insert them.

This is known as a clustered index structure.


πŸ” Let's break it down:

Without clustering (non-InnoDB engines, like MyISAM):

  • Table data is stored in the order of insertion.

  • Indexes point to the location (file offset) of the row.

With clustering (InnoDB):

  • Table data is stored inside the primary key index itself.

  • So, the primary key index is the table.

  • As a result, rows are automatically sorted by the primary key on disk.


πŸ“¦ Visual Example:

Suppose you have this data:

sql
INSERT INTO users (id, name) VALUES (3, 'Alice'), (1, 'Bob'), (2, 'Charlie');

Even though you inserted in order 3 β†’ 1 β†’ 2, InnoDB stores them on disk like:

1 β†’ Bob 2 β†’ Charlie 3 β†’ Alice

Because the primary key is id, and it's stored in sorted B-tree order.


πŸ”„ Why It Matters:

  • Fast primary key lookups: No extra step to find the row.

  • Range scans are efficient:

    sql
    SELECT * FROM users WHERE id BETWEEN 2 AND 5;

    Reads rows in order without jumping around on disk.

  • Secondary indexes reference the primary key, so having a good primary key design impacts overall performance.


Let me know if you want a visual of the B-tree or how InnoDB organizes data on disk!

give me both

Perfect! Let's go through both:


πŸ”’ 1. Visual of a B-Tree (Clustered Index in InnoDB)

In InnoDB, indexes (both primary and secondary) are stored using a B-tree structure. Here's a simplified view:

Clustered B-Tree for PRIMARY KEY (e.g., id)

Suppose you have a users table with:

sql
id | name ------------- 1 | Bob 2 | Charlie 3 | Alice

πŸ“‚ InnoDB Clustered Index (Primary Key: id)

[2] / \ [1] [3] Leaf nodes: 1 β†’ Bob 2 β†’ Charlie 3 β†’ Alice
  • The B-tree nodes are ordered by id.

  • Leaf nodes contain entire row data.

  • This is why it's called "clustered" β€” the table is the index.


🧭 2. Secondary Index Structure (e.g., on email)

Let’s say you create a secondary index:

sql
CREATE INDEX idx_email ON users(email);

Here’s how that looks internally:

πŸ“‚ Secondary Index B-Tree (email)

[bob@a.com] / \ [alice@a.com] [charlie@a.com] Leaf nodes: alice@a.com β†’ id=3 bob@a.com β†’ id=1 charlie@a.com β†’ id=2
  • The leaf nodes store the indexed column (email) + the primary key (id).

  • To get full row data, InnoDB does a lookup in the primary index using the stored id.


🧠 Summary:

Type Stores Row Data? Ordered By Efficiency
Primary Index βœ… Yes Primary Key Fastest for primary key lookups
Secondary Index ❌ No (just points to PK) Indexed column Extra lookup needed to get row

Would you like to see a real EXPLAIN plan to show how MySQL uses these indexes in practice?

why is 2 the root of the tree?

Great observation! In a B-tree, the root node being 2 in the example is just a simplification to show how values are organized logically, not how they are inserted or which value is "first."

Let me explain more clearly.


πŸ”’ Why Was 2 the Root?

In a B-tree:

  • The root node contains pointers to child nodes, helping to navigate to the correct leaf.

  • The value in the root is often the "split point" used to direct the search.

  • It doesn’t mean 2 was inserted first or is more important β€” it's just a middle value used to balance the tree.

So this example:

[2] / \ [1] [3]

means:

  • If you're looking for a value ≀ 2, go left.

  • If you're looking for a value > 2, go right.

  • The tree is balanced, which is the key property of a B-tree.


πŸ”„ How Does MySQL/InnoDB Build This Tree?

In reality:

  • MySQL doesn't just make 2 the root randomly.

  • It builds the tree incrementally as you insert data.

  • As more rows are added and nodes overflow, pages (blocks of storage) are split, and intermediate nodes are created.

That’s why the actual root value can change over time as the table grows.


🧠 Key Points:

Concept Meaning
Root node value Just a logical division point, not necessarily the first inserted.
B-tree property Keeps tree balanced for fast lookups (O(log n)).
Clustered index Leaf nodes contain full row data, ordered by primary key.
Secondary index Leaf nodes contain indexed column + primary key.

Would you like a real example of B-tree growth with inserts or a diagram showing how a B-tree splits and rebalances?

yes
How is this differenet than postgresql approach?

Great question! MySQL (specifically InnoDB) and PostgreSQL use very different approaches to indexing and physical row storage, especially regarding the primary key and clustering. Here's a clear comparison:


πŸ” Primary Storage and Indexing Differences

Feature MySQL (InnoDB) PostgreSQL
Primary index Clustered index β€” table data is stored in the primary key B-tree. Non-clustered by default. Data is stored separately from indexes.
Table structure Data is part of the primary key index. Table = heap (unordered rows). Indexes point to tuples.
Row lookup via index Secondary indexes use primary key for lookup. Indexes use a TID (tuple ID) pointer to locate rows.
Index type B+Tree (for both primary & secondary indexes). B+Tree by default (also supports GIN, GiST, BRIN, etc.).
Clustering Always clustered by primary key in InnoDB. Can be manually clustered (via CLUSTER), but it’s not maintained automatically.
MVCC (multi-version concurrency control) Versioning handled in undo log. Versioning stored in each row (via hidden xmin/xmax columns).
HOT updates Not supported. Supports HOT (Heap Only Tuples) updates β€” avoids index rewrites for some updates.

🧠 Key Conceptual Differences

πŸ”Έ MySQL InnoDB: Clustered Table

  • Only one primary index, and it defines the physical order of rows.

  • Secondary indexes must use the primary key to find data (extra lookup).

  • Efficient for range queries on primary key but rigid structure.

πŸ”Ή PostgreSQL: Heap Table + Indexes

  • Rows are not stored in index order.

  • Each index stores a pointer (TID) to the row's physical location.

  • More flexible, but can lead to bloat (dead tuples, need for VACUUM).

  • Allows more index types and indexing strategies.


πŸ§ͺ Example Scenario

sql
-- Same table in both systems CREATE TABLE users ( id SERIAL PRIMARY KEY, email TEXT ); CREATE INDEX idx_email ON users(email);
  • In MySQL InnoDB:

    • Table is physically ordered by id.

    • idx_email contains email + id, and uses id to find the full row.

  • In PostgreSQL:

    • Table is unordered (heap).

    • idx_email contains email + TID β€” directly points to the heap location.


πŸ”§ Clustered Indexing in PostgreSQL?

You can do this manually:

sql
CLUSTER users USING idx_email;

But:

  • It's a one-time operation.

  • The table won't stay clustered as you insert/update/delete.

  • Unlike MySQL, clustering is not automatic or maintained.


βœ… Summary

  • MySQL (InnoDB): Strongly coupled indexes and data β†’ faster PK lookups, but rigid.

  • PostgreSQL: More flexible storage and indexing β†’ better for complex queries and custom index strategies, but requires maintenance (e.g., VACUUM).

Let me know if you want performance examples or a deep dive into PostgreSQL's MVCC or index types like GIN or BRIN.

show me how posgres handles indexes

Absolutely! Let’s dive into how PostgreSQL handles indexes, from the basics to some of its advanced capabilities.


πŸ”Ή 1. Index Basics in PostgreSQL

In PostgreSQL, indexes are separate structures from the actual table data (heap). When you query a table using an indexed column, the index is searched first to locate the TID (tuple ID), which then points to the row in the table.

πŸ”Έ Example:

sql
CREATE TABLE users ( id SERIAL PRIMARY KEY, name TEXT, email TEXT ); CREATE INDEX idx_email ON users(email);

This creates a B-tree index on email.


πŸ” 2. How a PostgreSQL Index Works Internally

πŸ”Έ Structure:

  • PostgreSQL uses a B+ Tree structure by default for most indexes.

  • The index contains:

    • The indexed column(s) (e.g., email)

    • A TID (tuple ID) which points to the exact row in the heap

πŸ“‚ Visual:

Index (B-tree) ---------------------- email: alice@example.com β†’ (TID = page 5, slot 2) email: bob@example.com β†’ (TID = page 6, slot 1)

Then it uses the TID to retrieve the actual row in the heap.


🧠 3. PostgreSQL Index Types

PostgreSQL supports multiple index types, which is a key strength:

Index Type Use Case
B-tree Default. Fast for equality and range queries (=, <, >, BETWEEN).
Hash Fast for equality only. Rarely used.
GIN Great for full-text search and arrays.
GiST Spatial and geometric queries (PostGIS).
BRIN Efficient for very large tables with naturally ordered columns.
SP-GiST Used for non-balanced data like IPs, quadtrees.

πŸ“ˆ 4. Query Plan Example

sql
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'alice@example.com';

This might return:

Index Scan using idx_email on users Index Cond: (email = 'alice@example.com')

This means:

  • PostgreSQL uses idx_email to find the TID(s).

  • Then it fetches the matching row(s) from the heap.


πŸ”„ 5. VACUUM, TID, and MVCC

PostgreSQL uses MVCC (Multi-Version Concurrency Control), which:

  • Keeps multiple versions of a row in the heap.

  • Requires background maintenance (VACUUM) to clean up old versions.

  • Indexes always point to one version of the row, but that version might be dead β€” this is why VACUUM is critical.


βš™οΈ 6. Manual Clustering (Optional)

You can manually make a table match an index order:

sql
CLUSTER users USING idx_email;
  • This rewrites the table to match idx_email's order.

  • But it won’t stay clustered over time unless you re-run it manually.


βœ… Summary

Feature PostgreSQL Index Behavior
Index structure Separate from heap (points via TID)
Default index type B-tree
Maintains row order? No (heap)
Row versioning MVCC with VACUUM
Unique strengths Many index types, customizable indexing logic

Let me know if you'd like a demo of full-text search with GIN indexes, or an example of composite and partial indexes!

how it does handle secondary indexes?

Excellent question! In PostgreSQL, what we commonly call secondary indexes (anything that's not the primary key) are handled in a powerful and flexible way β€” but very differently from MySQL's InnoDB.


πŸ” What Are Secondary Indexes in PostgreSQL?

In PostgreSQL, all indexes (including primary and unique constraints) are stored separately from the table data (heap). This means:

  • The primary key is not clustered by default.

  • Secondary indexes work the same way structurally as a primary index β€” they’re just another index.


🧠 How PostgreSQL Handles Secondary Indexes

πŸ“¦ Structure:

A secondary index in PostgreSQL:

  • Contains:

    • The indexed column values

    • A TID (tuple ID) β†’ a physical pointer to the row in the heap (page number + row offset)

  • It does not store the full row.

  • The TID is used to fetch the actual row from the heap.


πŸ“‚ Example

Suppose you have:

sql
CREATE TABLE users ( id SERIAL PRIMARY KEY, email TEXT, created_at TIMESTAMP ); CREATE INDEX idx_created_at ON users(created_at);

Now PostgreSQL stores idx_created_at like:

B-tree Index (idx_created_at) ----------------------------- '2023-01-01' β†’ (page 12, slot 5) '2023-01-02' β†’ (page 13, slot 3)

So when you run:

sql
SELECT * FROM users WHERE created_at = '2023-01-01';
  • PostgreSQL scans the index.

  • Finds the TID (e.g., page 12, slot 5).

  • Uses the TID to fetch the full row from the heap.

This is known as an "Index Scan" + Heap Fetch.


πŸ” Index-Only Scans (if possible)

If your query only needs columns in the index, and PostgreSQL verifies that the row is visible (not outdated), it can skip the heap fetch entirely.

sql
CREATE INDEX idx_email_only ON users(email); SELECT email FROM users WHERE email = 'test@example.com';

PostgreSQL may do an Index-Only Scan β€” faster, since it avoids heap access.

Note: This only works if PostgreSQL's visibility map shows the row is not dead (i.e., no VACUUM needed).


βœ… Summary of Secondary Index Behavior in PostgreSQL

Feature PostgreSQL Behavior
Index location Stored separately from table (heap)
Lookup method Uses TID (tuple ID) to point to actual row
Works the same as primary? Yes, structurally identical
Full row access needed? Yes (unless it's an index-only scan)
Maintains table order? No β€” heap is unordered
Clustered automatically? No (must use CLUSTER, and it’s not maintained)

Would you like to see a real example with EXPLAIN ANALYZE to show how PostgreSQL executes queries with secondary indexes?

yes

Awesome! Let’s walk through a realistic example using PostgreSQL’s EXPLAIN ANALYZE to see how it handles secondary indexes.


πŸ§ͺ Example: Querying with a Secondary Index

1️⃣ Setup:

sql
CREATE TABLE users ( id SERIAL PRIMARY KEY, name TEXT, email TEXT, created_at TIMESTAMP ); -- Secondary index on `email` CREATE INDEX idx_email ON users(email);

Now let’s assume the table has many rows, and we want to query by email:

sql
SELECT * FROM users WHERE email = 'alice@example.com';

2️⃣ Run EXPLAIN ANALYZE:

sql
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'alice@example.com';

πŸ” Output (example):

Index Scan using idx_email on users (cost=0.29..8.30 rows=1 width=72) Index Cond: (email = 'alice@example.com') Actual time: 0.015..0.017 rows=1 loops=1 Planning Time: 0.095 ms Execution Time: 0.031 ms

🧠 What This Means:

Line Meaning
Index Scan PostgreSQL used the secondary index idx_email.
Index Cond It searched the index for email = 'alice@example.com'.
Heap Fetch Not shown explicitly, but it did one heap lookup using the TID from the index.
Execution Time Very fast β€” index usage was effective.

πŸ§ͺ Bonus: Index-Only Scan

Now let’s try a query that could avoid the heap fetch:

sql
-- Rebuild the visibility map with vacuum to enable index-only scan VACUUM ANALYZE users; EXPLAIN ANALYZE SELECT email FROM users WHERE email = 'alice@example.com';

πŸ” Possible Output:

Index Only Scan using idx_email on users (cost=0.29..8.30 rows=1 width=32) Index Cond: (email = 'alice@example.com') Heap Fetches: 0

Heap Fetches: 0 βœ… β†’ PostgreSQL didn't touch the heap. It got everything it needed from the index!


βœ… Summary

Query Type Index Used? Heap Access? Fastest?
SELECT * βœ… Yes βœ… Yes No (needs heap fetch)
SELECT email βœ… Yes ❌ Maybe βœ… If Index-Only Scan is possible

Would you like a full script to try this yourself with test data, or a diagram of the planner in action?

So how does a postgresql indexes for two differenet columns look like?

Great follow-up! In PostgreSQL, when you index two different columns, there are two primary strategies:


🎯 1. Composite Index (Multi-Column Index)

βœ… One index for multiple columns (e.g., (email, created_at))

sql
CREATE INDEX idx_email_created_at ON users(email, created_at);

πŸ” Internally:

PostgreSQL stores the index like this:

(email, created_at) β†’ TID

B-tree Order Example:

('alice@example.com', '2023-01-01') β†’ TID ('alice@example.com', '2023-01-02') β†’ TID ('bob@example.com', '2023-01-01') β†’ TID

πŸ”Ž How It Works:

  • Can be used for:

    • WHERE email = 'alice@example.com'

    • WHERE email = 'alice@example.com' AND created_at = '2023-01-01'

    • WHERE email = 'alice@example.com' AND created_at > '2023-01-01'

  • Cannot efficiently use:

    • WHERE created_at = '2023-01-01' (email must come first)

πŸ”‘ Key Rule:

PostgreSQL uses left-most prefix of the index.
If you want to search by created_at alone, this index won’t help much.


🧩 2. Two Separate Indexes

sql
CREATE INDEX idx_email ON users(email); CREATE INDEX idx_created_at ON users(created_at);

πŸ” Internally:

Two distinct indexes:

idx_email: ('alice@example.com') β†’ TID ('bob@example.com') β†’ TID idx_created_at: ('2023-01-01') β†’ TID ('2023-01-02') β†’ TID

πŸ”Ž How PostgreSQL Uses This:

  • If your query is:

    sql
    SELECT * FROM users WHERE email = 'alice@example.com' AND created_at = '2023-01-01';

    PostgreSQL may use:

    • Just one index (and filter on the other column)

    • Or do a Bitmap Index Scan: it combines both indexes efficiently!


πŸ€– Bitmap Index Scan: Combined Index Use

Example Query:

sql
SELECT * FROM users WHERE email = 'alice@example.com' AND created_at = '2023-01-01';

EXPLAIN ANALYZE Output:

Bitmap Heap Scan on users Recheck Cond: (email = 'alice@example.com' AND created_at = '2023-01-01') -> Bitmap Index Scan on idx_email -> Bitmap Index Scan on idx_created_at
  • PostgreSQL scans both indexes.

  • Intersects the results using a bitmap in memory.

  • Then fetches matching rows from the heap.


βœ… Summary

Approach Benefit Limitation
Composite Index Very fast for queries matching index prefix Useless if query skips left-most column
Separate Indexes More flexible for varying conditions May be slower; might use Bitmap Scan

Would you like a demo dataset and a full EXPLAIN plan comparing both strategies?