Andrew Lamb
andrewlamb1111.bsky.social
Andrew Lamb
@andrewlamb1111.bsky.social
Apache {DataFusion PMC}, Database Internals
"if you want to go fast, go alone; If you want to go far, go together"
New Apache Parquet Community page is up: parquet.apache.org/community/
November 7, 2025 at 8:06 PM
If anyone wants to know why Xiangpeng Hao is a great mentor, they can read this response: github.com/XiangpengHao...
November 3, 2025 at 8:16 PM
I have heard from 3 people/projects in the last three days they are considering forks of iceberg-rust. I filed a ticket to see if we can figure out how to consolidate efforts: github.com/apache/icebe...
October 28, 2025 at 5:50 PM
Some Apache Parquet nerd humor for Friday afternoon

lists.apache.org/thread/36rdg...
October 24, 2025 at 8:24 PM
We made Apache Parquet metadata parsing 3x-9x faster in the latest release of the Rust implementation
arrow.apache.org/blog/2025/10...
October 24, 2025 at 9:55 AM
More Products built with Apache DataFusion: Palantir Foundry's Pipeline Builder

www.palantir.com/docs/foundry...
October 21, 2025 at 7:52 PM
Prateek Gaur and co at Snowflake reproduced the (great) results for the ALP encoding algorithm from CWI / Azim Afroozeh / Peter Boncz

ALP achieves ZSTD levels of compression and much faster decode. We are discussing adding it to @ApacheParquet: lists.apache.org/thread/tjtln...
October 17, 2025 at 1:05 PM
Our new thrift parser in the Rust Apache Parquet implementation is a 🎁 that keeps on giving performance wise 🚀 github.com/apache/arrow...

We are also working on a blog post that has a deeper explanation
October 10, 2025 at 6:52 PM
Yesterday I learned about the SpatialBench from Sedona github.com/apache/sedon...

Which they based on the tpchgen-rs project from @clflushopt.bsky.social github.com/clflushopt/t...

(BTW I a still looking for some more github watchers on tpchgen-rs so I can get it on homebrew)
October 9, 2025 at 5:38 PM
"It is not 100% clear to me how a new file format (or three) will drive additional ecosystem adoption :thinking:"

However, I absolutely think this adds to the pressure for Parquet to evolve.

Speaking of, anyone interested in helping add new encodings to parquet?
lists.apache.org/thread/djnbb...
October 1, 2025 at 7:21 PM
Apache DataFusion 50 is released. Read all about it here: datafusion.apache.org/blog/2025/09...
September 29, 2025 at 1:47 PM
So Cool -- jcsherin added full text indexes into Parquet files using the techniques from our blog

github.com/jcsherin/dat...
September 25, 2025 at 3:48 PM
We just published an easier to find list of all PMC and committers of Apache DataFusion , and it is quite a cool list of people and affiliations if I do say so myself 🤗
datafusion.apache.org/contributor-...
September 19, 2025 at 3:26 PM
It was a great time on Monday at the @apachedatafusion.bsky.social meetup in NYC. We heard about distributed query plans, filter pushdown, geospatial support, and VegaFusion.

More deets here github.com/apache/dataf...
September 17, 2025 at 3:56 PM
Dynamic Filters for TopK and Join queries landing in DataFusion 50.0.0: datafusion.apache.org/blog/2025/09...
September 11, 2025 at 10:34 AM
What is LiquidCache in these slides: what-is-liquid-cache.xiangpeng.systems

BTW @xiangpeng.systems is looking for some early adopters who want to be on the bleeding edge. Hit me up if interested
September 10, 2025 at 8:19 PM
Recording of "Introduction to Variant in @ApacheParquet ": www.youtube.com/watch?v=nlOJ...

Here are the slides: docs.google.com/presentation...
September 6, 2025 at 9:54 AM
We want `brew install tpchgen-cli` to work, but that requires the project to be "popular" enough according to homebrew (30 forks, 30 watchers and 75 stars)

We just need 6 more forks and 24 watchers. Can you help us out?

github.com/clflushopt/t...

Deets: github.com/clflushopt/t...
September 5, 2025 at 1:16 PM
Thanks to @clflushopt.bsky.social, make massive TPCH datasets with tpchgen-cli 2.0:

SF1000 (1TB raw, 220GB in @ApacheParquet ) in less than 10 mins (6m45s) on aging laptop

Try it now:

pip install tpchgen-cli
tpchgen-cli --scale-factor 1000 --parts 100 --format=parquet

github.com/clflushopt/t...
September 4, 2025 at 12:51 PM
It is a common misconception that Parquet requires (slow) reparsing metadata and is limited to built in indexing structures.

Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Parquet with @apachedatafusion.bsky.social

datafusion.apache.org/blog/2025/08...
August 15, 2025 at 12:48 PM
"EDB claimed the new engine, which pushes queries to open source @apachedatafusion.bsky.social , returned queries 30x faster than standard Postgres while tiering offloads cold transactional data to storage is up 18x more cost-efficient."
www.theregister.com/2025/06/20/e...
July 30, 2025 at 12:12 PM
@apachedatafusion.bsky.social 48.0.0 release. Spark Compatible functions, ORDER BY ALL, FFI for aggregates and window functions: datafusion.apache.org/blog/2025/07...
July 16, 2025 at 12:47 PM
It is a common misconception that Apche Parquet files are restricted to basic statistics. Footer metadata and offset-based addressing permit user-defined index structures today.

@apachedatafusion.bsky.social blog from Qi Zhi, Jigao Luo and myself explains how datafusion.apache.org/blog/2025/07...
July 14, 2025 at 1:30 PM
Belated DataFusion 47.0.0 release blog: datafusion.apache.org/blog/2025/07...
July 11, 2025 at 11:10 AM
Apahce Iceberg compaction in Rust via github.com/nimtable/ice.... (based on @apachedatafusion.bsky.social ). Thanks to Rising Wave CEO YingjunWu for the shout out
July 10, 2025 at 10:02 PM