Joachim Rosskopf
jrosskopf.bsky.social
Joachim Rosskopf
@jrosskopf.bsky.social
Benchmarks Show This:

→ @DuckDB beats @Spark for small queries.
→ Even at 700GB, DuckDB (native files) is competitive.
→ Spark scales dynamically for 1TB+ workloads.

Details: https://buff.ly/47UvlMc

🔍 The lesson? If data fits on one node, go single-node for speed. Scale to MPP only when needed.
DataFrames at Scale Comparison: TPC-H
Hendrik Makait, Sarah Johnson, Matthew Rocklin 2024-05-14 14 min read We run benchmarks derived from the TPC-H benchmark suite on a variety of scales, hardware architectures, and dataframe projects...
buff.ly
December 10, 2024 at 10:51 AM
Why Are Object Stores So Attractive?

1️⃣ Scalability: Handle massive amounts of data.
2️⃣ Flexibility: Open formats like Iceberg for interoperability.
3️⃣ Advanced Features: Replication, immutability, and consistency.

They became the backbone of modern distributed systems.
December 8, 2024 at 10:51 AM
What Are "One-Way Door" Risks?

❌ One-way doors = irreversible decisions.
In tech: adopting new tools or models without clear exit paths.
December 8, 2024 at 10:51 AM
Curious where the data comes from?
🔗 Snowset (Snowflake's dataset): https://buff.ly/4eULXoQ
🔗 Redset (Redshift's dataset): https://buff.ly/3CScB4x

Both share real-world query samples, packed with insights into how data warehouses are used. Check them out!
GitHub - resource-disaggregation/snowset: Snowflake dataset containing statistics for 70 million queries over 14 day period
Snowflake dataset containing statistics for 70 million queries over 14 day period - resource-disaggregation/snowset
github.com
December 4, 2024 at 10:51 AM