Lightnews — Scholar-powered news

Andrew Lamb

@andrewlamb1111.bsky.social

"if you want to go fast, go alone; If you want to go far, go together"
New Apache Parquet Community page is up: parquet.apache.org/community/

November 7, 2025 at 8:06 PM

Andrew Lamb

@andrewlamb1111.bsky.social

If anyone wants to know why Xiangpeng Hao is a great mentor, they can read this response: github.com/XiangpengHao...

November 3, 2025 at 8:16 PM

Andrew Lamb

@andrewlamb1111.bsky.social

I have heard from 3 people/projects in the last three days they are considering forks of iceberg-rust. I filed a ticket to see if we can figure out how to consolidate efforts: github.com/apache/icebe...

October 28, 2025 at 5:50 PM

Andrew Lamb

@andrewlamb1111.bsky.social

Some Apache Parquet nerd humor for Friday afternoon

lists.apache.org/thread/36rdg...

October 24, 2025 at 8:24 PM

Andrew Lamb

@andrewlamb1111.bsky.social

We made Apache Parquet metadata parsing 3x-9x faster in the latest release of the Rust implementation
arrow.apache.org/blog/2025/10...

October 24, 2025 at 9:55 AM

Andrew Lamb

@andrewlamb1111.bsky.social

More Products built with Apache DataFusion: Palantir Foundry's Pipeline Builder

www.palantir.com/docs/foundry...

October 21, 2025 at 7:52 PM

Andrew Lamb

@andrewlamb1111.bsky.social

Prateek Gaur and co at Snowflake reproduced the (great) results for the ALP encoding algorithm from CWI / Azim Afroozeh / Peter Boncz

ALP achieves ZSTD levels of compression and much faster decode. We are discussing adding it to @ApacheParquet: lists.apache.org/thread/tjtln...

October 17, 2025 at 1:05 PM

Andrew Lamb

@andrewlamb1111.bsky.social

Our new thrift parser in the Rust Apache Parquet implementation is a 🎁 that keeps on giving performance wise 🚀 github.com/apache/arrow...

We are also working on a blog post that has a deeper explanation

October 10, 2025 at 6:52 PM

Andrew Lamb

@andrewlamb1111.bsky.social

Yesterday I learned about the SpatialBench from Sedona github.com/apache/sedon...

Which they based on the tpchgen-rs project from @clflushopt.bsky.social github.com/clflushopt/t...

(BTW I a still looking for some more github watchers on tpchgen-rs so I can get it on homebrew)

October 9, 2025 at 5:38 PM

Andrew Lamb

@andrewlamb1111.bsky.social

"It is not 100% clear to me how a new file format (or three) will drive additional ecosystem adoption :thinking:"

However, I absolutely think this adds to the pressure for Parquet to evolve.

Speaking of, anyone interested in helping add new encodings to parquet?
lists.apache.org/thread/djnbb...

October 1, 2025 at 7:21 PM

Andrew Lamb

@andrewlamb1111.bsky.social

Apache DataFusion 50 is released. Read all about it here: datafusion.apache.org/blog/2025/09...

September 29, 2025 at 1:47 PM

Andrew Lamb

@andrewlamb1111.bsky.social

So Cool -- jcsherin added full text indexes into Parquet files using the techniques from our blog

github.com/jcsherin/dat...

September 25, 2025 at 3:48 PM

Andrew Lamb

@andrewlamb1111.bsky.social

We just published an easier to find list of all PMC and committers of Apache DataFusion , and it is quite a cool list of people and affiliations if I do say so myself 🤗
datafusion.apache.org/contributor-...

September 19, 2025 at 3:26 PM

Andrew Lamb

@andrewlamb1111.bsky.social

It was a great time on Monday at the @apachedatafusion.bsky.social meetup in NYC. We heard about distributed query plans, filter pushdown, geospatial support, and VegaFusion.

More deets here github.com/apache/dataf...

September 17, 2025 at 3:56 PM

Andrew Lamb

@andrewlamb1111.bsky.social

Dynamic Filters for TopK and Join queries landing in DataFusion 50.0.0: datafusion.apache.org/blog/2025/09...

September 11, 2025 at 10:34 AM

Andrew Lamb

@andrewlamb1111.bsky.social

What is LiquidCache in these slides: what-is-liquid-cache.xiangpeng.systems

BTW @xiangpeng.systems is looking for some early adopters who want to be on the bleeding edge. Hit me up if interested

September 10, 2025 at 8:19 PM

Andrew Lamb

@andrewlamb1111.bsky.social

Recording of "Introduction to Variant in @ApacheParquet ": www.youtube.com/watch?v=nlOJ...

Here are the slides: docs.google.com/presentation...

September 6, 2025 at 9:54 AM

Andrew Lamb

@andrewlamb1111.bsky.social

We want `brew install tpchgen-cli` to work, but that requires the project to be "popular" enough according to homebrew (30 forks, 30 watchers and 75 stars)

We just need 6 more forks and 24 watchers. Can you help us out?

github.com/clflushopt/t...

Deets: github.com/clflushopt/t...

September 5, 2025 at 1:16 PM

Andrew Lamb

@andrewlamb1111.bsky.social

Thanks to @clflushopt.bsky.social, make massive TPCH datasets with tpchgen-cli 2.0:

SF1000 (1TB raw, 220GB in @ApacheParquet ) in less than 10 mins (6m45s) on aging laptop

Try it now:

pip install tpchgen-cli
tpchgen-cli --scale-factor 1000 --parts 100 --format=parquet

github.com/clflushopt/t...

September 4, 2025 at 12:51 PM

Andrew Lamb

@andrewlamb1111.bsky.social

It is a common misconception that Parquet requires (slow) reparsing metadata and is limited to built in indexing structures.

Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Parquet with @apachedatafusion.bsky.social

datafusion.apache.org/blog/2025/08...

August 15, 2025 at 12:48 PM

Andrew Lamb

@andrewlamb1111.bsky.social

"EDB claimed the new engine, which pushes queries to open source @apachedatafusion.bsky.social , returned queries 30x faster than standard Postgres while tiering offloads cold transactional data to storage is up 18x more cost-efficient."
www.theregister.com/2025/06/20/e...

July 30, 2025 at 12:12 PM

Andrew Lamb

@andrewlamb1111.bsky.social

@apachedatafusion.bsky.social 48.0.0 release. Spark Compatible functions, ORDER BY ALL, FFI for aggregates and window functions: datafusion.apache.org/blog/2025/07...

July 16, 2025 at 12:47 PM

Andrew Lamb

@andrewlamb1111.bsky.social

It is a common misconception that Apche Parquet files are restricted to basic statistics. Footer metadata and offset-based addressing permit user-defined index structures today.

@apachedatafusion.bsky.social blog from Qi Zhi, Jigao Luo and myself explains how datafusion.apache.org/blog/2025/07...

July 14, 2025 at 1:30 PM

Andrew Lamb

@andrewlamb1111.bsky.social

Belated DataFusion 47.0.0 release blog: datafusion.apache.org/blog/2025/07...

July 11, 2025 at 11:10 AM

Andrew Lamb

@andrewlamb1111.bsky.social

Apahce Iceberg compaction in Rust via github.com/nimtable/ice.... (based on @apachedatafusion.bsky.social ). Thanks to Rising Wave CEO YingjunWu for the shout out

July 10, 2025 at 10:02 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news