Andrew Lamb
@andrewlamb1111.bsky.social
Apache {DataFusion PMC}, Database Internals
Here is a nice examination of the benefits of building new systems using the extensibility of @apachedatafusion.bsky.social vs other systems. www.bauplanlabs.com/post/duck-hu...
Duck Hunt: moving Bauplan from DuckDB to DataFusion
Bauplan's journey from DuckDB to Apache DataFusion: how switching SQL engines doubled query performance on Iceberg lakehouses while enabling greater hackability
www.bauplanlabs.com
November 11, 2025 at 3:41 PM
Here is a nice examination of the benefits of building new systems using the extensibility of @apachedatafusion.bsky.social vs other systems. www.bauplanlabs.com/post/duck-hu...
Reposted by Andrew Lamb
Excited to be one of the attendees and present our work on the DataFusion-powered SedonaDB alongside a great lineup of talks! If you're in the Boston area come and say hi!
We are holding the next Apache DataFusion meetup next Wednesday Nov 12 in Boston. lu.ma/w9pw5rce
Boston Apache DataFusion Meetup · Luma
Join us for an evening of talks, panel discussion, and community discussion about Apache DataFusion and its growing role in modern data infrastructure. This…
lu.ma
November 4, 2025 at 6:33 PM
Excited to be one of the attendees and present our work on the DataFusion-powered SedonaDB alongside a great lineup of talks! If you're in the Boston area come and say hi!
"if you want to go fast, go alone; If you want to go far, go together"
New Apache Parquet Community page is up: parquet.apache.org/community/
New Apache Parquet Community page is up: parquet.apache.org/community/
November 7, 2025 at 8:06 PM
"if you want to go fast, go alone; If you want to go far, go together"
New Apache Parquet Community page is up: parquet.apache.org/community/
New Apache Parquet Community page is up: parquet.apache.org/community/
We are holding the next Apache DataFusion meetup next Wednesday Nov 12 in Boston. lu.ma/w9pw5rce
Boston Apache DataFusion Meetup · Luma
Join us for an evening of talks, panel discussion, and community discussion about Apache DataFusion and its growing role in modern data infrastructure. This…
lu.ma
November 4, 2025 at 6:05 PM
We are holding the next Apache DataFusion meetup next Wednesday Nov 12 in Boston. lu.ma/w9pw5rce
If anyone wants to know why Xiangpeng Hao is a great mentor, they can read this response: github.com/XiangpengHao...
November 3, 2025 at 8:16 PM
If anyone wants to know why Xiangpeng Hao is a great mentor, they can read this response: github.com/XiangpengHao...
New version of Rust Apache Arrow and Apache Parquet is out -- includes new new metadata parser, new avro reader, geometry and variant support 🤯 arrow.apache.org/blog/2025/10...
Apache Arrow Rust 57.0.0 Release
The Apache Arrow team is pleased to announce that the v57.0.0 release of Apache Arrow Rust is now available on crates.io (arrow and parquet) and as source download. See the 57.0.0 changelog for a full...
arrow.apache.org
October 31, 2025 at 10:26 AM
New version of Rust Apache Arrow and Apache Parquet is out -- includes new new metadata parser, new avro reader, geometry and variant support 🤯 arrow.apache.org/blog/2025/10...
I have heard from 3 people/projects in the last three days they are considering forks of iceberg-rust. I filed a ticket to see if we can figure out how to consolidate efforts: github.com/apache/icebe...
October 28, 2025 at 5:50 PM
I have heard from 3 people/projects in the last three days they are considering forks of iceberg-rust. I filed a ticket to see if we can figure out how to consolidate efforts: github.com/apache/icebe...
Apache DataFusion's policy for AI assisted contribution:
AI is great, but not AI dumps: maintainers could finish the task faster by using AI directly, and the submitters gain little knowledge when acting as a pass through AI proxy.
datafusion.apache.org/contributor-...
AI is great, but not AI dumps: maintainers could finish the task faster by using AI directly, and the submitters gain little knowledge when acting as a pass through AI proxy.
datafusion.apache.org/contributor-...
Introduction — Apache DataFusion documentation
datafusion.apache.org
October 27, 2025 at 12:51 PM
Apache DataFusion's policy for AI assisted contribution:
AI is great, but not AI dumps: maintainers could finish the task faster by using AI directly, and the submitters gain little knowledge when acting as a pass through AI proxy.
datafusion.apache.org/contributor-...
AI is great, but not AI dumps: maintainers could finish the task faster by using AI directly, and the submitters gain little knowledge when acting as a pass through AI proxy.
datafusion.apache.org/contributor-...
October 24, 2025 at 8:24 PM
We made Apache Parquet metadata parsing 3x-9x faster in the latest release of the Rust implementation
arrow.apache.org/blog/2025/10...
arrow.apache.org/blog/2025/10...
October 24, 2025 at 9:55 AM
We made Apache Parquet metadata parsing 3x-9x faster in the latest release of the Rust implementation
arrow.apache.org/blog/2025/10...
arrow.apache.org/blog/2025/10...
Reposted by Andrew Lamb
Today's Future Data Systems Seminar Speaker: Ian Cook (@ian.columnar.tech) will present @columnar.tech's work on Apache Arrow's database connectivity API (ADBC). ADBC is available in modern DBMSs. Zoom talk open to public at 4:30pm ET. YouTube video available after: db.cs.cmu.edu/events/futur...
[Future Data] Where We're Going, We Don't Need Rows: Columnar Data Connectivity with ADBC - Carnegie Mellon Database Group
ADBC (Arrow Database Connectivity) is Apache Arrow’s answer to ODBC and JDBC:... Read More +
db.cs.cmu.edu
October 20, 2025 at 11:38 AM
Today's Future Data Systems Seminar Speaker: Ian Cook (@ian.columnar.tech) will present @columnar.tech's work on Apache Arrow's database connectivity API (ADBC). ADBC is available in modern DBMSs. Zoom talk open to public at 4:30pm ET. YouTube video available after: db.cs.cmu.edu/events/futur...
More Products built with Apache DataFusion: Palantir Foundry's Pipeline Builder
www.palantir.com/docs/foundry...
www.palantir.com/docs/foundry...
October 21, 2025 at 7:52 PM
More Products built with Apache DataFusion: Palantir Foundry's Pipeline Builder
www.palantir.com/docs/foundry...
www.palantir.com/docs/foundry...
Prateek Gaur and co at Snowflake reproduced the (great) results for the ALP encoding algorithm from CWI / Azim Afroozeh / Peter Boncz
ALP achieves ZSTD levels of compression and much faster decode. We are discussing adding it to @ApacheParquet: lists.apache.org/thread/tjtln...
ALP achieves ZSTD levels of compression and much faster decode. We are discussing adding it to @ApacheParquet: lists.apache.org/thread/tjtln...
October 17, 2025 at 1:05 PM
Prateek Gaur and co at Snowflake reproduced the (great) results for the ALP encoding algorithm from CWI / Azim Afroozeh / Peter Boncz
ALP achieves ZSTD levels of compression and much faster decode. We are discussing adding it to @ApacheParquet: lists.apache.org/thread/tjtln...
ALP achieves ZSTD levels of compression and much faster decode. We are discussing adding it to @ApacheParquet: lists.apache.org/thread/tjtln...
The talk on Votex @db.cs.cmu.edu youtube.com/watch?v=zyn_... is a great one.
I think it would also be interesting to hear a counterpoint about
Apache Parquet that explains actual technical details of that format, the Cathedral vs Bizzaar management, options with Metadata, etc
I think it would also be interesting to hear a counterpoint about
Apache Parquet that explains actual technical details of that format, the Cathedral vs Bizzaar management, options with Metadata, etc
Vortex: LLVM for File Formats (Will Manning)
YouTube video by CMU Database Group
youtube.com
October 15, 2025 at 12:57 PM
The talk on Votex @db.cs.cmu.edu youtube.com/watch?v=zyn_... is a great one.
I think it would also be interesting to hear a counterpoint about
Apache Parquet that explains actual technical details of that format, the Cathedral vs Bizzaar management, options with Metadata, etc
I think it would also be interesting to hear a counterpoint about
Apache Parquet that explains actual technical details of that format, the Cathedral vs Bizzaar management, options with Metadata, etc
Our new thrift parser in the Rust Apache Parquet implementation is a 🎁 that keeps on giving performance wise 🚀 github.com/apache/arrow...
We are also working on a blog post that has a deeper explanation
We are also working on a blog post that has a deeper explanation
October 10, 2025 at 6:52 PM
Our new thrift parser in the Rust Apache Parquet implementation is a 🎁 that keeps on giving performance wise 🚀 github.com/apache/arrow...
We are also working on a blog post that has a deeper explanation
We are also working on a blog post that has a deeper explanation
Yesterday I learned about the SpatialBench from Sedona github.com/apache/sedon...
Which they based on the tpchgen-rs project from @clflushopt.bsky.social github.com/clflushopt/t...
(BTW I a still looking for some more github watchers on tpchgen-rs so I can get it on homebrew)
Which they based on the tpchgen-rs project from @clflushopt.bsky.social github.com/clflushopt/t...
(BTW I a still looking for some more github watchers on tpchgen-rs so I can get it on homebrew)
October 9, 2025 at 5:38 PM
Yesterday I learned about the SpatialBench from Sedona github.com/apache/sedon...
Which they based on the tpchgen-rs project from @clflushopt.bsky.social github.com/clflushopt/t...
(BTW I a still looking for some more github watchers on tpchgen-rs so I can get it on homebrew)
Which they based on the tpchgen-rs project from @clflushopt.bsky.social github.com/clflushopt/t...
(BTW I a still looking for some more github watchers on tpchgen-rs so I can get it on homebrew)
BTW if anyone wants a good intro to database storage / Log structured storage (aka LSM trees) @db.cs.cmu.edu lecture this fall is a good one: www.youtube.com/watch?v=2_sT...
#05 - Log-Structured Database Storage ✸ SingleStore Database Talk (CMU Intro to Database Systems)
YouTube video by CMU Database Group
www.youtube.com
October 7, 2025 at 1:32 PM
BTW if anyone wants a good intro to database storage / Log structured storage (aka LSM trees) @db.cs.cmu.edu lecture this fall is a good one: www.youtube.com/watch?v=2_sT...
It starts: github.com/clflushopt/t...
@clflushopt.bsky.social is going to make the worlds fastest tpc-ds generator
@clflushopt.bsky.social is going to make the worlds fastest tpc-ds generator
GitHub - clflushopt/tpcdsgen: WIP (out of tree) Rust implementation of TPC-DS generators.
WIP (out of tree) Rust implementation of TPC-DS generators. - clflushopt/tpcdsgen
github.com
October 2, 2025 at 11:48 AM
It starts: github.com/clflushopt/t...
@clflushopt.bsky.social is going to make the worlds fastest tpc-ds generator
@clflushopt.bsky.social is going to make the worlds fastest tpc-ds generator
Apache DataFusion 50 is released. Read all about it here: datafusion.apache.org/blog/2025/09...
September 29, 2025 at 1:47 PM
Apache DataFusion 50 is released. Read all about it here: datafusion.apache.org/blog/2025/09...
CloudFlare's Distributed R2 SQL engine's is a pretty good exemplar of how to build a serverless database to process petabytes in seconds using Apache DataFusion and Apache Parquet
blog.cloudflare.com/r2-sql-deep-...
blog.cloudflare.com/r2-sql-deep-...
R2 SQL: a deep dive into our new distributed query engine
R2 SQL provides a built-in, serverless way to run ad-hoc analytic queries against your R2 Data Catalog. This post dives deep under the Iceberg into how we built this distributed engine, from its metad...
blog.cloudflare.com
September 26, 2025 at 10:29 AM
CloudFlare's Distributed R2 SQL engine's is a pretty good exemplar of how to build a serverless database to process petabytes in seconds using Apache DataFusion and Apache Parquet
blog.cloudflare.com/r2-sql-deep-...
blog.cloudflare.com/r2-sql-deep-...
Reposted by Andrew Lamb
I cannot say enough about DataFusion...in order to build an engine that considers spatial types at every level we needed to customize types, functions, optimizer rules, joins, Parquet pruning, and more. DataFusion not only made this possible but documented even the most obscure bits. So cool!
"Introducing SedonaDB: A single-node analytical database engine with geospatial as a first-class citizen"
Built in Rust with Apache DataFusion
sedona.apache.org/latest/blog/...
Built in Rust with Apache DataFusion
sedona.apache.org/latest/blog/...
Introducing SedonaDB: A single-node analytical database engine with geospatial as a first-class citizen - Apache Sedona
Apache Sedona is a cluster computing system for processing large-scale spatial data. Sedona extends existing cluster computing systems, such as Apache Spark, Apache Flink, and Snowflake, with a set of...
sedona.apache.org
September 25, 2025 at 1:35 AM
I cannot say enough about DataFusion...in order to build an engine that considers spatial types at every level we needed to customize types, functions, optimizer rules, joins, Parquet pruning, and more. DataFusion not only made this possible but documented even the most obscure bits. So cool!
So Cool -- jcsherin added full text indexes into Parquet files using the techniques from our blog
github.com/jcsherin/dat...
github.com/jcsherin/dat...
September 25, 2025 at 3:48 PM
So Cool -- jcsherin added full text indexes into Parquet files using the techniques from our blog
github.com/jcsherin/dat...
github.com/jcsherin/dat...
"Introducing SedonaDB: A single-node analytical database engine with geospatial as a first-class citizen"
Built in Rust with Apache DataFusion
sedona.apache.org/latest/blog/...
Built in Rust with Apache DataFusion
sedona.apache.org/latest/blog/...
Introducing SedonaDB: A single-node analytical database engine with geospatial as a first-class citizen - Apache Sedona
Apache Sedona is a cluster computing system for processing large-scale spatial data. Sedona extends existing cluster computing systems, such as Apache Spark, Apache Flink, and Snowflake, with a set of...
sedona.apache.org
September 24, 2025 at 9:20 PM
"Introducing SedonaDB: A single-node analytical database engine with geospatial as a first-class citizen"
Built in Rust with Apache DataFusion
sedona.apache.org/latest/blog/...
Built in Rust with Apache DataFusion
sedona.apache.org/latest/blog/...
We just published an easier to find list of all PMC and committers of Apache DataFusion , and it is quite a cool list of people and affiliations if I do say so myself 🤗
datafusion.apache.org/contributor-...
datafusion.apache.org/contributor-...
September 19, 2025 at 3:26 PM
We just published an easier to find list of all PMC and committers of Apache DataFusion , and it is quite a cool list of people and affiliations if I do say so myself 🤗
datafusion.apache.org/contributor-...
datafusion.apache.org/contributor-...
And we are also adding Geometry to the Rust parquet implementation . Huge thanks to @kylebarron.dev github.com/apache/arrow...
[EPIC] [Parquet] Implement Geometry and Geography type support in Parquet · Issue #8373 · apache/arrow-rs
Is your feature request related to a problem or challenge? Please describe what you are trying to do. Parquet recently adopted Geometry and Geography types: apache/parquet-format@master/Geospatial....
github.com
September 17, 2025 at 6:49 PM
And we are also adding Geometry to the Rust parquet implementation . Huge thanks to @kylebarron.dev github.com/apache/arrow...