Jack Vanlightly
banner
vanlightly.bsky.social
Jack Vanlightly
@vanlightly.bsky.social
Researcher, advisor, writer, formal verification eng @ Confluent.
Everything data (dist sys, databases, messaging, data eng/analytics).

https://jack-vanlightly.com, https://www.hotds.dev
Credit: ESO/B. Tafresh
Stream-order vs batch-order in Iceberg:
* Flink wants temporal locality.
* Spark wants value locality.

Same table, conflicting physics.

New post: jack-vanlightly.com/blog/2025/11...
How Would You Like Your Iceberg Sir? Stream or Batch Ordered? — Jack Vanlightly
Today I want to talk about stream analytics, batch analytics and Apache Iceberg. Stream and batch analytics work differently but both can be built on top of Iceberg, but due to their differences there...
jack-vanlightly.com
November 5, 2025 at 2:52 PM
Three KIPs (1150, 1176, 1183) all target Kafka’s cross-AZ replication costs but there is a wider question at stake.

My new post explains the KIPs, the trade-offs between reusing old abstractions vs. embracing stateless compute over S3.

jack-vanlightly.com/blog/2025/10...
A Fork in the Road: Deciding Kafka’s Diskless Future — Jack Vanlightly
“ The Kafka community is currently seeing an unprecedented situation with three KIPs ( KIP-1150 , KIP-1176 , KIP-1183) simultaneously addressing the same challenge of high replica...
jack-vanlightly.com
October 22, 2025 at 12:51 PM
New post: why I’m not a fan of “zero-copy” Iceberg tables for Apache Kafka.
From a systems design view, it trades storage savings for coupling and complexity.
Sometimes, duplication is cheaper than coupling.
jack-vanlightly.com/blog/2025/10...
Why I’m not a fan of zero-copy Apache Kafka-Apache Iceberg — Jack Vanlightly
Over the past few months, I’ve seen a growing number of posts on social media promoting the idea of a “zero-copy” integration between Apache Kafka and Apache Iceberg. The idea is that Kafka topics cou...
jack-vanlightly.com
October 15, 2025 at 1:39 PM
Why don’t Iceberg or Delta Lake have secondary indexes?
Because analytics workloads and OLTP workloads optimize for opposite I/O patterns.

See my dive into data layout, pruning, and what “indexing” really means in open table formats: jack-vanlightly.com/blog/2025/10...
Beyond Indexes: How Open Table Formats Optimize Query Performance — Jack Vanlightly
My career in data started as a SQL Server performance specialist, which meant I was deep into the nuances of indexes, locking and blocking, execution plan analysis and query design. These days I’m mor...
jack-vanlightly.com
October 8, 2025 at 1:01 PM
New deep dive: Understanding Apache Fluss

I spent August reverse-engineering Fluss, Alibaba’s new table storage engine for Flink (partially forked from Kafka). This post covers its architecture, tiering, and how it tackles changelogs & low-latency state.

jack-vanlightly.com/blog/2025/9/...
Understanding Apache Fluss — Jack Vanlightly
This is a data system internals blog post. So if you enjoyed my table formats internals blog posts , or writing on Apache Kafka internals or Apache BookKeeper internals , you might enjoy thi...
jack-vanlightly.com
September 2, 2025 at 12:57 PM
New blog post: A Conceptual Model for Storage Unification.

The post defines what storage unification means, defines terminology and evaluates different building blocks and approaches to doing it.

jack-vanlightly.com/blog/2025/8/...
A Conceptual Model for Storage Unification — Jack Vanlightly
Object storage is taking over more of the data stack, but low-latency systems still need separate hot-data storage. Storage unification is about presenting these heterogeneous storage systems and form...
jack-vanlightly.com
August 21, 2025 at 1:16 PM
In a future of autonomous AI agents, we can't limit ourselves to error prevention and error detection, we must also include remediation.

jack-vanlightly.com/blog/2025/7/...
Remediation: What happens after AI goes wrong? — Jack Vanlightly
If you’re following the world of AI right now, no doubt you saw Jason Lemkin’s post on social media reporting how Replit’s AI deleted his production database , despite it being told not to touch an...
jack-vanlightly.com
July 28, 2025 at 12:17 PM
Science moves slowly because wrong theories waste decades. Engineering is careful because failures kill people. Software moves fast because mistakes are cheap, the expensive error isn't making the wrong choice, it's taking too long to make any choice. jack-vanlightly.com/blog/2025/7/...
The Cost of Being Wrong — Jack Vanlightly
A recent LinkedIn post by Nick Lebesis caught my attention with this brutal take on the difference between good startup founders and coward startup founders. I recommend you read the entire thing ...
jack-vanlightly.com
July 22, 2025 at 3:09 PM
Where does reliability begin, and where does it end? In distributed business architectures, the answer is responsibility boundaries. New post: jack-vanlightly.com/blog/2025/7/...
Responsibility Boundaries in the Coordinated Progress model — Jack Vanlightly
Building on my previous work on the Coordinated Progress model, this post examines how reliable triggers not only initiate work but also establish responsibility boundaries . Where a reliable tri...
jack-vanlightly.com
July 15, 2025 at 2:16 PM
ChatGPT thought it was Tuesday, so I made fun of it and it admitted it was Wednesday. So I made fun of it again, and it admitted it was...Wednesday. But sure, AI agents are gonna steal my job 🤔
July 3, 2025 at 4:22 PM
ChatGPT has hallucinated so many times for me today. It's invented scientific terms that don't exist, has been quite liberal with plausible answers based on what sounds reasonable, but without any real world justification. When challenged, it admits it's mistake.
June 24, 2025 at 6:32 PM
My musical evolution continues, discovered deep hypnotic drone music today. No drugs required 😄 The Hypnus Records label is great.
June 13, 2025 at 2:33 PM
How to reliably distribute work across microservices, stream processors, durable execution, event-driven, orchestration and now AI agents?

Coordinated Progress is a 4 part series that explores the common structure behind reliable distributed systems.

jack-vanlightly.com/blog/2025/6/...
Coordinated Progress – Part 1 – Seeing the System: The Graph — Jack Vanlightly
At some point, we’ve all sat in an architecture meeting where someone asks, “ Should this be an event? An RPC? A queue? ”, or “ How do we tie this process together across our microservices? Should it ...
jack-vanlightly.com
June 11, 2025 at 2:29 PM
I took a break from social media and my blog for a couple of months. ND burnout. But I'm tentatively back, probably just to post my writing here for now. HOTDS is on pause. Getting back to writing is therapeutic though. I'll post something this week that I've been working on.
June 9, 2025 at 11:23 AM
Another Humans of the Data Sphere is out, with issue 10! In this issue people are talking fsyncs, tips for running ClickHouse at scale, the problems with MCP and more. Plus I dig up a classic paper from 1962. www.hotds.dev/p/humans-of-...
Humans of the Data Sphere Issue #10 April 4th 2025
Your biweekly dose of insights, observations, commentary and opinions from interesting people from the world of databases, AI, streaming, distributed systems and the data engineering/analytics space.
www.hotds.dev
April 4, 2025 at 4:15 PM
Proud to have contributed formal verification (TLA+) for three key improvements in Kafka 4.0:

✅ KIP-966: Strengthens the replication protocol.
✅ KIP-996: Introduces PreVote for more stable KRaft leadership.
✅ KIP-848: Delivers more efficient, predictable rebalancing.
April 3, 2025 at 4:00 PM
Wow, I just discovered gamma wave music. Wrote non-stop for three hours.
March 25, 2025 at 1:01 PM
Any Principal Engineers out there with ADHD or creative wiring — who don’t thrive in the tasks of project coordination, alignment meetings, and people management, but thrive on strategy, system design, writing, and shaping direction through ideas? Curious how you navigate the role.
March 21, 2025 at 1:52 PM
A new disaggregated log replication survey post is out. How does the combination of Apache Pulsar with Apache BookKeeper divide and conquer the responsibilities of log replication? jack-vanlightly.com/blog/2025/3/...
Log Replication Disaggregation Survey - Apache Pulsar and BookKeeper — Jack Vanlightly
In this latest post of the disaggregated log replication survey, we’re going to look at the Apache BookKeeper Replication Protocol and how it is used by Apache Pulsar to form topic partitions. Raft ...
jack-vanlightly.com
March 13, 2025 at 12:53 PM
I think I have an issue with tabs, it's grown to 371. My workstation is struggling to open chrome after a restart now.
March 12, 2025 at 7:18 PM
Another Humans of the Data Sphere is out, with issue #9! In this issue, we also look at whether software engineers can learn from mechanical engineering, and looking at table formats as a form of virtualization. www.hotds.dev/p/humans-of-...
Humans of the Data Sphere Issue #9 March 11th 2025
Your biweekly dose of insights, observations, commentary and opinions from interesting people from the world of databases, AI, streaming, distributed systems and the data engineering/analytics space.
www.hotds.dev
March 11, 2025 at 3:35 PM
A new log replication disaggregation survey post is out!
The Kafka Replication Protocol:
🔹Separation of control plane from data plane.
🔹Role separation with minimal coupling.
🔹Kafka’s alignment with Paxos roles.
jack-vanlightly.com/blog/2025/2/...
Log Replication Disaggregation Survey - Kafka Replication Protocol — Jack Vanlightly
In this post, we’re going to look at the Kafka Replication Protocol and how it separates control plane and data plane responsibilities. It’s worth noting there are other systems that separate concerns...
jack-vanlightly.com
February 21, 2025 at 3:16 PM
Spotify is so bad at recommendations, but ChatGPT is pretty good at it. I give it a song and it lists the different characteristics of the song, then makes a set of recommendations based on those different characteristics.
February 19, 2025 at 5:02 PM
The first post in the survey of disaggregated log replication systems is out! It looks at Neon's serverless Postgres write-path, which weaves consensus from heterogeneous components, based on MultiPaxos.

jack-vanlightly.com/blog/2025/2/...
Log Replication Disaggregation Survey - Neon and MultiPaxos — Jack Vanlightly
Over the next series of posts, we'll explore how various real-world systems and some academic papers have implemented log replication with some form of disaggregation. In this first post we’ll look at...
jack-vanlightly.com
February 19, 2025 at 2:56 PM
I updated my How to Disaggregate a Log Replication Protocol to include "separating ordering from IO". Basically I couldn't ignore CORFU as a way of separating responsibilities! So now we have A-F of ways of breaking apart the monolith. jack-vanlightly.com/blog/2025/2/...
How to disaggregate a log replication protocol — Jack Vanlightly
This post continues my series looking at log replication protocols, within the context of state-machine replication (SMR) or just when the log itself is the product (such as Kafka). So far I’ve been l...
jack-vanlightly.com
February 18, 2025 at 5:03 PM