Lightnews — Scholar-powered news

Reposted

DuckDB

@duckdb.org

The PyData Amsterdam 2025 keynote “Minus Three Tier: Data Architecture Turned Upside Down” by @hannes.muehleisen.org is out now.

www.youtube.com/watch?v=DxwD...

KEYNOTE: Hannes Mühleisen - Data Architecture Turned Upside Down | PyData Amsterdam 2025

YouTube video by PyData

www.youtube.com

October 31, 2025 at 2:05 PM

Reposted

Andy Pavlo

@andypavlo.bsky.social

New database leaderboard from Yellowbrick ranks the quality of DBMS optimizer estimates and plans. They only evaluate TPC-H for now and report results for Postgres + DuckDB + MSSQL: sql-arena.com/components/p...
Repo: github.com/sql-arena/db...
LinkedIn Group: www.linkedin.com/groups/15775...

SQL Arena Planner Ranking (November 2025)

November 3, 2025 at 5:07 PM

Reposted

CMU Database Group

@db.cs.cmu.edu

Today's Future Data Systems Seminar Speaker: Ian Cook (@ian.columnar.tech) will present @columnar.tech's work on Apache Arrow's database connectivity API (ADBC). ADBC is available in modern DBMSs. Zoom talk open to public at 4:30pm ET. YouTube video available after: db.cs.cmu.edu/events/futur...

[Future Data] Where We're Going, We Don't Need Rows: Columnar Data Connectivity with ADBC - Carnegie Mellon Database Group

ADBC (Arrow Database Connectivity) is Apache Arrow’s answer to ODBC and JDBC:... Read More +

db.cs.cmu.edu

October 20, 2025 at 11:38 AM

Reposted

CMU Database Group

@db.cs.cmu.edu

Today's Future Data Systems Seminar Speaker: Will Manning (@willmanning.com) will present @spiraldb.com's Vortex file format. Vortex is now a @linuxfoundation.org project. Zoom talk open to public at 4:30pm ET. YouTube video available after: db.cs.cmu.edu/events/futur...

[Future Data] Vortex: LLVM for File Formats - Carnegie Mellon Database Group

Apache Parquet revolutionized columnar storage after its initial release in 2013, but... Read More +

db.cs.cmu.edu

October 13, 2025 at 11:10 AM

xevix.bsky.social

@xevix.bsky.social

Processing 100Tb of CSV files on a single machine is insane, little over 1hr per query, even if on a powerful AWS instance. Question heavily the need for complex systems when this is what’s possible now. Can’t wait for full write-up. Incredible work.

duckdb.org/2025/10/09/b...

Benchmark Results for DuckDB v1.4 LTS

DuckDB v1.4 LTS is both fast and scalable. In in-memory mode, it is the fastest system on ClickBench. In disk-based mode, it can run complex analytical queries on a dataset equivalent to 100 TB CSV fi...

duckdb.org

October 10, 2025 at 2:12 PM

xevix.bsky.social

@xevix.bsky.social

Taking the DuckDb hoodie on a trip. Not exactly Amsterdam but I’ve heard they like columnar databases here too.

October 4, 2025 at 12:06 PM

xevix.bsky.social

@xevix.bsky.social

Congrats to DuckDB team on LTS release w/ many great improvements! Hidden among them you can now use Hive filtering with read_blob, and SHOW TABLES FROM specific db w/o USE.

DuckDB @duckdb.org · Sep 16

📈 DuckDB 1.4.0 is out! This is our first LTS release which comes with *one year of community support*. It also supports database encryption, the MERGE SQL statement and Iceberg writes.

For more details, read the announcement blog post at
duckdb.org/2025/09/16/a...

September 16, 2025 at 4:25 PM

Reposted

DuckDB

@duckdb.org

📈 DuckDB 1.4.0 is out! This is our first LTS release which comes with *one year of community support*. It also supports database encryption, the MERGE SQL statement and Iceberg writes.

For more details, read the announcement blog post at
duckdb.org/2025/09/16/a...

September 16, 2025 at 11:55 AM

xevix.bsky.social

@xevix.bsky.social

I tried loading eBird data (1.5B rows CSV ZIP) using DuckDB for fun, inspired by a Clickhouse blog post and a bit of curiosity. Both did well, DuckDB slightly faster querying and Parquet ingest, Clickhouse w/ native zip support, optimized for ingest and multitenancy. xevix.medium.com/ebird-in-duc...

eBird in DuckDB

I saw this post by the Clickhouse team which was doing a cool test of the eBird dataset from Cornell University, and wondered how DuckDB…

xevix.medium.com

September 2, 2025 at 1:15 AM

Reposted

PVLDB

@pvldb.bsky.social

Vol:18 No:8 → Saving Private Hash Join
👥 Authors: Laurens Kuiper, Paul Gross, Peter Boncz, Hannes Mühleisen
📄 PDF: https://www.vldb.org/pvldb/vol18/p2748-kuiper.pdf

August 3, 2025 at 6:00 AM

xevix.bsky.social

@xevix.bsky.social

Is there too much duplicated effort in data tools? I sometimes wonder about this.

xevix.medium.com/data-tool-co...

Data Tool Component Sharing

There are many partly overlapping tools in the data world, which is what inspired things like Calcite to have modular components for…

xevix.medium.com

August 29, 2025 at 8:08 PM

xevix.bsky.social

@xevix.bsky.social

Compiling DuckDB on Windows 11 (ARM) using UTM VM on macOS to debug Windows compile issues. It's a shame msvc doesn't exist outside of Windows, mingw/clang don't work the same and cross-compiling is tricky. Compiling takes 5-10 mins (instead of 1-2 mins native), but it works 🎉!

August 25, 2025 at 9:30 PM

xevix.bsky.social

@xevix.bsky.social

Stretching DuckDB w/ Common Crawl, ~1.7B rows, ~300 parquet files. ~2-3s for single-column aggregations, ~2-3 mins to SUMMARIZE the data, peaking at ~12-14GB memory usage. Not exactly real-time, but the fact you can do this on a laptop with no server setups or Spark pipelines is still amazing.

August 15, 2025 at 3:10 AM

xevix.bsky.social

@xevix.bsky.social

Neat little hack to get Hive partition list in DuckDB, useful for an overview. Might be neat to have built-in. gist.github.com/xevix/04f33d...

August 12, 2025 at 8:14 PM

xevix.bsky.social

@xevix.bsky.social

Added an Automator quick action to run sqlfluff for formatting SQL in browser fields, used here in the DuckDB UI. Only needs sqlfluff, optionally configure rules. Would be cool to get built-in one day, but works for now.

June 29, 2025 at 7:46 PM

xevix.bsky.social

@xevix.bsky.social

Apache Drill allowed storing metadata in an RDBMS, Iceberg scaling data, Arrow scaling columnar memory, Parquet columnar storage, Spark distributed compute, DuckDB single-node compute. DuckLake scales metadata and storage w/ compute on single node. Motherduck distributes compute.

June 26, 2025 at 11:17 PM

Reposted

Andy Pavlo

@andypavlo.bsky.social

Shots fired by @firebolthq.bsky.social with their new on-prem executable (www.firebolt.io/blog/introdu...). They have dethroned the Umbra system by The Germans™ at ‪@tum.de in the ClickBench rankings: benchmark.clickhouse.com

June 24, 2025 at 11:10 PM

xevix.bsky.social

@xevix.bsky.social

Vibe coding NOAA GHCN weather visualization from scratch w/ Claude Code and DuckDB MCP. There's Evidence and other vis tools but I don't want a pre-cached set of data, I want it to query live. Cool that this can be put together w/o writing my own HTML/CSS, as a web backend dev 😅

June 21, 2025 at 10:58 PM

xevix.bsky.social

@xevix.bsky.social

Sped up Hive-partitioned query in DuckDB by directly adding Hive keys to the filepath rather than filtering in WHERE. 2-level Hive partition w/ 300 level-1, 100 level-2 partitions. 300ms -> 20ms filtering a single level-1 partition which is fantastic.

June 21, 2025 at 8:08 PM

xevix.bsky.social

@xevix.bsky.social

Good in-depth demonstration of the power of R-tree and H3 with benchmarks.

www.architecture-performance.fr/ap_blog/spat...

Spatial queries in DuckDB with R-tree and H3 indexing - Architecture et Performance

www.architecture-performance.fr

June 19, 2025 at 7:47 AM

xevix.bsky.social

@xevix.bsky.social

Veo 3 to generate a duck writing SQL. Bias is still an issue, but pretty amusing.

June 8, 2025 at 1:22 AM

xevix.bsky.social

@xevix.bsky.social

Been trying to use Cursor+Gemini Pro 2.5 in agent mode to add a new feature. It's gone into a lot of loops, and then fell into full-on desperation 😂. Laughing pretty hard at its attempts. I think software engineering is safe for a bit longer.

June 7, 2025 at 3:10 AM

xevix.bsky.social

@xevix.bsky.social

Tested Claude Desktop DuckDB MCP on 275 years of weather parquet data on my local filesystem and asking for analysis, and it concluded human-caused global warming unprompted 😶‍🌫️. Cool that it could tell when to check filesystem and when to use DuckDB for SQL querying.

June 5, 2025 at 11:27 PM

xevix.bsky.social

@xevix.bsky.social

New text to SQL benchmark from BirdSQL folks, with complex BI tasks etc. Latest models already doing very well at ~40% success. Mindblowing how many hours of work will be saved once T2S is widespread.

livesqlbench.ai

LiveSQLBench

A Dynamic and Contamination-Free Benchmark for Evaluating LLMs on Real-World Text-to-SQL Tasks

livesqlbench.ai

May 31, 2025 at 12:45 AM

xevix.bsky.social

@xevix.bsky.social

What does DuckLake enable that we can't do today? Basic Datalake without Snowflake, without Iceberg's issues. Extreme scalability not needed for most people, but organized cloud-backed SQL querying is useful for a lot of people. Multiuser via e.g. Postgres and S3 is a low barrier to entry in 2025.

DuckDB @duckdb.org · May 27

Today we're launching DuckLake, an integrated data lake and catalog format powered by SQL. DuckLake unlocks next-generation data warehousing where compute is local, consistency central, and storage scales till infinity. ⁠ducklake is an open standard and we implemented it in the "ducklake" extension.

May 28, 2025 at 2:19 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news