Lightnews — Scholar-powered news

Joachim Rosskopf

@jrosskopf.bsky.social

81 followers 440 following 10 posts

Posts Replies Media Videos

Joachim Rosskopf

@jrosskopf.bsky.social

MPP vs. Single-Node Engines

Small workloads? Use @DuckDb or @Polars for faster in-memory performance.
Massive datasets? MPP systems like @Spark or @Snowflake scale dynamically.

Experiment: @DuckDB outperformed Spark at <100GB.

💡 Don't drive groceries shopping with a tank!

December 10, 2024 at 10:51 AM

Joachim Rosskopf

@jrosskopf.bsky.social

The Future of Distributed Systems

Object storage like S3 has become databases — scalable & efficient for transactional & analytical workloads.

Emerging programming models:
1️⃣ Distributed DBs: On files
2️⃣ Serverless: Focus on code
3️⃣ Wasm: Portable execution

Challenge: "one-way-door” innovation

December 8, 2024 at 10:51 AM

Joachim Rosskopf

@jrosskopf.bsky.social

The Iceberg Effect

Modern data is evolving:
→ Iceberg now leads open table formats (Snowflake & Databricks adoption confirms it).
→ Cloud-native storage is a must (legacy systems won’t keep up).
→ AI thrives on scalable, open architectures.

More innovation. Less vendor lock-in.
Ready to shift?

December 6, 2024 at 3:14 PM

Joachim Rosskopf

@jrosskopf.bsky.social

What Do Data Warehouses Really Do?

→ $300K/year on Snowflake, and 90% is spent on queries.
→ Most queries are tiny (median: 100MB, 99.9% <300GB).
→ Most workloads = ingestion + transformation (not analytics).

💡 Small Data > Massive Complexity.
Are we overpaying for simplicity?

The plot breaks down the cost of various query types in Redshift and Snowflake data warehouses:

Ingest: Involves bringing new data into the system and merging it with existing data.
Transformation: Converts raw data into simplified, easy-to-query views, making it more usable for business applications.
Read: Focuses on business intelligence dashboards and data science workloads, extracting insights from the data.
Export: Sends data out of the data warehouse for use in other systems.
Other: Consists mainly of system maintenance functions necessary to keep the data warehouse operational.

Of queries that scan at least 1 MB, the median query scans around 100 MB, while the 99.9th percentile reaches 300 GB. Despite being 'massively parallel processing' systems, databases like Snowflake and Redshift mainly handle queries that could easily fit on a single large node.

December 4, 2024 at 10:51 AM

Joachim Rosskopf

@jrosskopf.bsky.social

Think Small. Make Big Impact.

More Data ≠ Better Results.
→ Recent data is the most valuable.
→ Smaller AI models deliver bigger impact.
→ Local-first development works.

Stop relying on distributed complexity when single machines get the job done.

The #SmallData Movement is here. Are you in?

December 2, 2024 at 10:51 AM

Joachim Rosskopf

@jrosskopf.bsky.social

BigData isn’t the problem—it never was.

Most enterprises have <100GB in active data but overpay for tools designed for massive scale (#Snowflake, #Databricks, etc.).

Focus on #SmallData:
→ Easier to analyze
→ Cheaper to manage
→ Faster insights

Time to rethink your data strategy. #SmallData

November 30, 2024 at 3:44 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news