Lightnews — Scholar-powered news

rasib

@rasib.bsky.social

Will write the rest in Obsidian now :)

October 24, 2025 at 8:45 PM

rasib

@rasib.bsky.social

Sidenote:

This does make me wonder how a similar migration would look like in Postgres.

They use heap files so their primary and secondary index should be equally fast... But do heap files imply that even our initial naive approach would work just fine?

October 24, 2025 at 8:38 PM

rasib

@rasib.bsky.social

This made the migration go from taking about 3 days to less than 5 minutes!

October 24, 2025 at 8:36 PM

rasib

@rasib.bsky.social

The solution in our case was simple:

We queried the rows to get a list of all affected primary keys first, and then we ran the UPDATE query over this list.

October 24, 2025 at 8:36 PM

rasib

@rasib.bsky.social

Secondary indexes, on the other hand, store pointer to primary keys. They look kinda like this:

age=22 → primary_key=101
age=29 → primary_key=103
age=35 → primary_key=120

So filtering by secondary index still requires about O(k log n)

October 24, 2025 at 8:33 PM

rasib

@rasib.bsky.social

MySQL stores data in Index-organized table.

In an IOT, the primary index (B-Tree) also stores the data. So any operation on primary key requires only a single B-tree lookup.

October 24, 2025 at 8:29 PM

rasib

@rasib.bsky.social

This made the migration go from taking about 3 days to less than 5 minutes!

October 24, 2025 at 8:19 PM

rasib

@rasib.bsky.social

So by calling in a database expert, we fixed our script by first querying the rows first to get a list of all affected primary key, and then running the UPDATE query over this list.

October 24, 2025 at 8:15 PM

rasib

@rasib.bsky.social

We would know how dumb our approach was if we knew how MySQL worked.

October 24, 2025 at 8:10 PM

rasib

@rasib.bsky.social

When we ran the migration (5 am :) off peak hours), we saw CPU util to rise to consistent 99%, and logs showed us that expected completion time was around 72 hours. So we aborted almost immediately.

October 24, 2025 at 8:10 PM

rasib

@rasib.bsky.social

Now, I wonder how this would look like in Postgres. They use heap files so their primary and secondary index should be equally fast.

October 24, 2025 at 8:00 PM

rasib

@rasib.bsky.social

So by querying the rows first to get a list of all affected primary key, and then runing the UPDATE query over this list: we went from the migration taking us about 3 days to less than 5 minutes.

October 24, 2025 at 7:58 PM

rasib

@rasib.bsky.social

In an IOT, the primary B-Tree also stores the data. So any operation on primary key is requires only a B-tree lookup. Secondary indexes on the otherhand store pointer to primary keys kinda like this:

age=22 → primary_key=101
age=29 → primary_key=103
age=35 → primary_key=120

October 24, 2025 at 7:54 PM

rasib

@rasib.bsky.social

We would know how dumb our approach was if we knew MySQL stored data in Index-organized table (IOT).

October 24, 2025 at 7:54 PM

rasib

@rasib.bsky.social

We saw the migration script caused CPU util to go to 99% and was expected to take over 3 days to complete.

October 24, 2025 at 7:50 PM

rasib

@rasib.bsky.social

Wide Column Stores are conceptually multidimensional map with hierarchical indexes. But in storage instead of nesting, they are flat list with a row key sorted by column family.

October 24, 2025 at 6:22 PM

rasib

@rasib.bsky.social

Seems like column-layout also have better compression and cache utilization and they might be easier to vectorize (pandas, SIMD).

October 23, 2025 at 9:02 PM

rasib

@rasib.bsky.social

Storing metadata to map back to rows is important for other queries though. Every data point holding a key is probably too much duplication. Using offset based matching seems more reasonable.

October 23, 2025 at 8:30 PM

rasib

@rasib.bsky.social

Column-Oriented stores like ClickHouse are better at complex aggregates quires like finding trends, averages etc.

October 23, 2025 at 8:28 PM

rasib

@rasib.bsky.social

Column vs Row-Oriented DBMS difference is interesting. I didn't know MySQL and PostgreSQL try to store each row together in a page. I guess this means it's cheap to select * instead of specific columns since we have to read same number of pages anyway.

October 23, 2025 at 5:48 PM

rasib

@rasib.bsky.social

It was an fun watch. Got me excited enough to start your Crust of Rust series :)

October 15, 2025 at 4:26 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news