Data Elixir
banner
dataelixir.com
Data Elixir
@dataelixir.com
Data Elixir is a weekly newsletter with curated data science picks from around the web. Subscribe at dataelixir.com and follow us here for selections between issues. Covering machine learning, data visualization, analytics, and strategy.
Thinking Machines Lab solved a problem everyone accepted as unsolvable: LLM nondeterminism at temperature 0. Same prompt, same model, 1000 runs → 80 different outputs. With batch-invariant kernels? Bitwise identical every time. Open sourced. www.distributedthoughts.org/will-i-make-...
Will I Make It To The Restaurant Before The Soup Dumplings Get Cold? (And Other Problems In Machine Learning)
I'm chronically late. Not because I want to be rude - I feel terrible about it every single time - but because I'm catastrophically bad at predicting how long it takes to get anywhere. Turns out…
www.distributedthoughts.org
November 8, 2025 at 3:47 AM
Most marketplaces have SKUs. Etsy has 100M+ unique items with no standard attributes. How do you build filters when one listing is a "porcelain sculpture that looks like a t-shirt" and dimensions live in random photo text? www.etsy.com/codeascraft/...
www.etsy.com
November 6, 2025 at 3:37 AM
GeoUtil converts between GeoJSON, TopoJSON, Shapefile, KML, WKT, and CSV without touching a server. TopoJSON compression alone cuts file sizes 80%+ while preserving topology. All free, all browser-based. geoutil.com
GeoUtil — Free Online Map & Geography Tools
All-in-one online geography toolkit. Measure distance & area, convert GeoJSON, TopoJSON, JSON, merge or minify files, and more — fast, free, and browser-based.
geoutil.com
October 31, 2025 at 2:47 PM
Debugging constraint problems is backwards: remove constraints until something works, then figure out what broke. No stack traces, just an "unsatisfiable." Forces you to think differently about what you're actually asking the system to solve. www.righto.com/2025/10/solv...
Solving the NYTimes Pips puzzle with a constraint solver
The New York Times recently introduced a new daily puzzle called Pips . You place a set of dominoes on a grid, satisfying various condition...
www.righto.com
October 30, 2025 at 3:04 PM
Most interpretable models sacrifice accuracy. Most accurate models are black boxes. TRUST breaks this trade-off by combining parametric and non-parametric approaches and offers full prediction explanations without losing performance. Published at PRICAI 2025.
trust-free
Transparent, Robust & Ultra-Sparse Trees (TRUST™) - Free Version
pypi.org
October 29, 2025 at 2:37 PM
Data products shouldn't live forever by default. Netflix treats outdated metrics like deprecated software and actively sunset them. The cost of maintaining zombie datasets? Lost trust and accumulated technical debt that blocks innovation.
Data as a Product: Applying a Product Mindset to Data at Netflix
Introduction: What if we treated data with the same care and intentionality as a consumer-facing product? Adopting a “data as a product”…
netflixtechblog.medium.com
October 29, 2025 at 12:13 AM
Reposted by Data Elixir
What will you be using this #30DayMapChallenge? Cadence is offering its users £2,500 worth of prizes this #30daymapchallenge. And every user (new or existing) gets a free Professional upgrade for November! Learn more: cadence.cityscience.com/blog/30-day-...
30 Day Map Challenge
The 30 Day Map Challenge with Cadence – November 2025  This November, Cadence is proud to support the 30 Day Map Challenge – a global celebration of maps, creativity and storytelling. Whether …
cadence.cityscience.com
October 24, 2025 at 12:20 PM
Why does neural network training almost never fail? Pure combinatorics. A 6K parameter network contains 10^1089 possible sparse subnetworks. That's 10^900 solutions per atom in the universe. We're not smart, we're just brute forcing.
Sparse Networks and Lottery Winners
Embedding Space is a blog about machine learning and artificial intelligence.
embedding-space.github.io
October 24, 2025 at 2:47 PM
The best part about #30DayMapChallenge is that it's tool-agnostic. Whether you're using QGIS, Python's geopandas, R's rayshader, or even Blender for 3D visualizations, the focus is on creativity over tech stack. No programming required.
30DayMapChallenge
Daily mapping challenge happening every November!
30daymapchallenge.com
October 23, 2025 at 1:06 PM
Stop fitting separate binomial models to compositional data! Your predictions that 130% of respondents chose option A reveal fundamental model misspecification. Dirichlet regression with Gaussian processes respects constraints.
Compositional modeling of plant communities with Dirichlet regression | GAMbler
Compositional data appears everywhere in scientific research, yet many analysts fall back on problematic approaches that ignore fundamental mathematical constraints. I demonstrate how Dirichlet…
ecogambler.netlify.app
October 23, 2025 at 2:37 AM
AI systems are now generating, testing, and validating their own hypotheses. DeepMind's Co-Scientist and Stanford's Virtual Lab represent something new: AI as actual scientific collaborator, not just a fancy search engine.
State of AI Report 2025
The State of AI Report analyses the most interesting developments in AI. Read and download here.
www.stateof.ai
October 17, 2025 at 2:47 PM
Black and white, hand-drawn data viz countering visual noise at one of NYC's busiest hubs. Smart choice. Sometimes the most effective data art isn't about adding more complexity but finding clarity in the chaos through intentional restraint.
‘A Data Love Letter to the Subway’
A data-driven animation for Fulton Center commissioned by MTA Arts & Design for its 40th anniversary.
www.pentagram.com
October 17, 2025 at 1:01 AM
The metrics that matter most are the hardest to measure. Bookings take weeks to materialize, but clicks happen instantly. The trap: optimizing for clicks can actually decrease bookings. Correlation isn't causation, especially in A/B tests.
How to estimate correlation between metrics from past A/B tests
Authors: Miha Gazvoda, Christina Katsimerou
booking.ai
October 15, 2025 at 2:37 PM
Psychology's dirty secret: "data available upon request" usually means data not available at all. Researchers found systematic patterns in why data disappears over time. Open-washing is real and it's undermining reproducibility.
LnuOpen | Meta-Psychology
Many journals now require data sharing and require articles to include a Data Availability Statement. However, several studies over the past two decades have shown that promissory notes about data…
open.lnu.se
October 15, 2025 at 12:13 AM
Reposted by Data Elixir
Today my @nytimes.com colleagues and I are launching a new series called Lost Science. We interview US scientists who can no longer discover something new about our world, thanks to this year‘s cuts. Here is my first interview with a scientist who studied bees and fires. Gift link: nyti.ms/3IWXbiE
nyti.ms
October 8, 2025 at 11:29 PM
Parquet is showing its age. CMU researchers built F3 - a columnar format that embeds WebAssembly decoders directly in files. Universal compatibility without the usual compatibility hell. Smart approach for modern ML workloads.

db.cs.cmu.edu/papers/2025/...
db.cs.cmu.edu
October 13, 2025 at 1:43 PM
Mathematicians feared nuclear winter would freeze Earth, but it turns out CO2 might do it instead. The math behind climate tipping points is fascinating and terrifyingly unpredictable. Sometimes the thing you're not watching is the real threat.
The Math of Climate Change Tipping Points | Quanta Magazine
Tipping points in our climate predictions are both wildly dramatic and wildly uncertain. Can mathematicians make them useful?
www.quantamagazine.org
October 10, 2025 at 3:43 PM
Reposted by Data Elixir
Data dictionary template: osf.io/ynqcu
Project summary template: osf.io/q6g8d
Dataset level README template: osf.io/tk4cb
October 8, 2025 at 2:54 PM
"Silicon samples" - using LLMs to generate fake survey responses instead of recruiting humans. Sounds efficient until you realize small model tweaks completely flip your results. Shortcuts in research usually aren't.
The threat of analytic flexibility in using large language models to simulate human data: A call to attention
Social scientists are now using large language models to create "silicon samples" - synthetic datasets intended to stand in for human respondents, aimed at revolutionising human subjects research.…
arxiv.org
October 9, 2025 at 1:08 PM
Your ggplot2 charts work fine, but are they memorable? Real color engineering: brightness first (strongest differentiator), then hue, finally saturation. Most people get this backwards and wonder why their viz falls flat. www.chartography.net/p/color-engi...
Color Engineering
The tool that snaps mercurial design into mechanical focus.
www.chartography.net
October 8, 2025 at 5:49 PM
Tuesday Picks!...
September 30, 2025 at 1:28 PM
We're going to get back into sharing useful posts for the week. Here's this week's shortlist...
September 25, 2025 at 3:56 PM
Reposted by Data Elixir
Technical writing is hard bcs "writing is thinking" but we often should tell our story not in the order we worked. Solution? I wrote a quick post on how @quarto.org 's embed shortcodes can reframe technical writing as reproducible evidence curation

www.emilyriederer.com/post/quarto-...

🧵 (1/n)
How Quarto embed fixes data science storytelling | Emily Riederer
Literate programming excels at capturing our stream of conscience. Our stream of conscience does not excel at explaining the impact of our work. Notebooks enable some of data scientists’ worst tendenc...
www.emilyriederer.com
July 27, 2025 at 1:14 PM
Reposted by Data Elixir
We've released 4 new chapters of Applied Machine Learning for Tabular Data.

Includes: Bayesian optimization, feature selection, model comparisons, classification metrics, calibration, #rstats computing sections, and more

blog.aml4td.org/posts/2025-0...
Part 3 is Finished, Part 4 Started – Applied Predictive Modeling Blog
blog.aml4td.org
July 25, 2025 at 4:53 PM
Reposted by Data Elixir
🧪 If you had your NSF grant terminated, I'd *highly* recommend going to this Friday's webinar.

Friday May 9 at 2-3 pm ET

Will talk about both the appeals process, in addition to allowable closeout costs. Share widely.

Register here: us02web.zoom.us/webinar/regi...
May 5, 2025 at 8:29 PM