Lightnews — Scholar-powered news

Andy Pavlo

@andypavlo.bsky.social

TIL

November 10, 2025 at 4:34 PM

Reposted by Andy Pavlo

Sam Arch

@samarchdb.bsky.social

Great idea to compare plans across different systems using rows processed. A good yardstick, but slower sort-based plans from Postgres + MSSQL process fewer rows than faster hash-based plans from DuckDB. Postgres rows scanned also seem underreported. Nice to see some competition with ClickBench.

November 3, 2025 at 5:28 PM

Andy Pavlo

@andypavlo.bsky.social

I've been working on this for +15 years and I think we're finally there.
LLM reasoning agent decides what sub-agent to invoke based on circumstances. Those sub-agents could be using heuristics or ML. Our new holistic algorithm is not LLM-based but we LLMs to reduce training: arxiv.org/abs/2510.17748

This is Going to Sound Crazy, But What If We Used Large Language Models to Boost Automatic Database Tuning Algorithms By Leveraging Prior History? We Will Find Better Configurations More Quickly Than ...

Tuning database management systems (DBMSs) is challenging due to trillions of possible configurations and evolving workloads. Recent advances in tuning have led to breakthroughs in optimizing over the...

arxiv.org

October 24, 2025 at 3:18 AM

Andy Pavlo

@andypavlo.bsky.social

This is to make up for not calling my biological daughter "DROP TABLE students; --". My wife didn't go for it when I tried.
twitter.com/weschow/stat...

Wes Chow on X: "@andy_pavlo @DeepGenes If there was someone who could have actually pulled it off... I guess there's always the next one! Congrats! https://t.co/Xet1zRlNd8" / X

@andy_pavlo @DeepGenes If there was someone who could have actually pulled it off... I guess there's always the next one! Congrats! https://t.co/Xet1zRlNd8

x.com

October 23, 2025 at 3:43 PM

Andy Pavlo

@andypavlo.bsky.social

The company is officially called "SO-YOU-DONT-HAVE-TO INCORPORATED'); DROP TABLE companies; --".
A lot of websites and the IRS don't like that name though.
We will announce more about it later this year. You can sign up to be on the waitlist: sydht.ai

SO-YOU-DONT-HAVE-TO INCORPORATED'); DROP TABLE companies; --

SO-YOU-DONT-HAVE-TO is a next generation automated PostgreSQL optimization platform based on agentic artifical intelligence.

sydht.ai

October 23, 2025 at 3:09 PM

Reposted by Andy Pavlo

Artem Krylysov

@artem.krylysov.com

MMAP is incredibly fast when the dataset fits in memory, but it slows to a crawl when it doesn't, especially if the workload is mostly random point lookups. Speaking as someone who built an MMAP-based key-value store before :) Obligatory paper from @andypavlo.bsky.social db.cs.cmu.edu/mmap-cidr2022/

October 11, 2025 at 3:39 PM

Andy Pavlo

@andypavlo.bsky.social

Good. You need to build your strength back up to prep for your next challenge as CS dept chair!

October 4, 2025 at 1:02 AM

Andy Pavlo

@andypavlo.bsky.social

What are you talking about? MapReduce is the opposite of "moving compute to the data". It was all about moving/pulling the data to compute in a shared-disk architecture. See this old paper: dl.acm.org/doi/10.1145/...

A comparison of approaches to large-scale data analysis | Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

dl.acm.org

October 2, 2025 at 2:32 PM

Andy Pavlo

@andypavlo.bsky.social

There was collaboration attempt between CMU, Tsinghua, Meta, CWI, Nvidia, Voltron, & SpiralDB. But then lawyers got involved and it fell apart. Everyone released their own format:
→ Meta Nimble: github.com/facebookincu...
→ CWI FastLanes: github.com/cwida/FastLa...
→ SpiralDB Vortex: vortex.dev

GitHub - facebookincubator/nimble: New file format for storage of large columnar datasets.

New file format for storage of large columnar datasets. - facebookincubator/nimble

github.com

October 1, 2025 at 1:49 PM

Andy Pavlo

@andypavlo.bsky.social

Our F3 files embed small WASM programs to decode data. If somebody creates a new encoding and the DBMS does not have native impl, it can still read data using WASM passing Arrow buffers. Our experiments show WASM is 15-20% slower than native. We use @spiraldb.com's Vortex encoding impls.

Overview of F3's decoding pipeline with WASM support.

October 1, 2025 at 1:49 PM

Andy Pavlo

@andypavlo.bsky.social

One problem with Parquet is many implementations are not updated when the official spec improves. Everyone just uses the lowest version feature set. That means if Parquet adds a better data encoding scheme and a file uses it, many common reader libraries won't be able retrieve the data.

Survey of the features used in public Parquet files.

October 1, 2025 at 1:49 PM

Andy Pavlo

@andypavlo.bsky.social

Shoot I don't know how I missed that when I was copy-pasting. It wasn't intentional. Sorry :-(

September 18, 2025 at 2:18 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news