Lightnews — Scholar-powered news

Reposted by David H Brown

Brandon Rohrer

@brandonrohrer.com

This makes me happy because it confirms my bias that if you really want to impact SOTA, focus on the data. Training data, data preprocessing, post hoc analysis of high-error data points. It’s not flashy but that’s the pay dirt.

Sung Kim @sungkim.bsky.social · Sep 7

Hugging Face's FinePDFs

The largest publicly available corpus sourced exclusively from PDFs, containing about 3 trillion tokens across 475 million documents in 1733 languages.

- Long context
- 3T tokens from high-demand domains like legal and science.
- Heavily improves over SoTA

September 7, 2025 at 10:59 AM

David H Brown

@davhbrown.bsky.social

grugbrain.dev

The Grug Brained Developer

grugbrain.dev

August 15, 2025 at 3:39 AM

David H Brown

@davhbrown.bsky.social

dansinker.com/posts/2025-0...

The Who Cares Era | dansinker.com

dansinker.com

May 28, 2025 at 1:38 PM

Reposted by David H Brown

Andrew Heiss

@andrew.heiss.phd

I’ve long used FiveThirtyEight’s interactive “Hack Your Way To Scientific Glory” to illustrate the idea of p-hacking when I teach statistics. But ABC/Disney killed the site earlier this month :(

So I made my own with #rstats and Observable and #QuartoPub ! stats.andrewheiss.com/hack-your-way/

Screenshot of the linked Quarto website, with input checkboxes to change different conditions for a regression model that predicts economic performance based on US political party, with a reported p-value

March 20, 2025 at 6:30 PM

David H Brown

@davhbrown.bsky.social

github.com/google-resea...

GitHub - google-research/tuning_playbook: A playbook for systematically maximizing the performance of deep learning models.

A playbook for systematically maximizing the performance of deep learning models. - google-research/tuning_playbook

github.com

March 22, 2025 at 3:47 PM

David H Brown

@davhbrown.bsky.social

On the clustering behavior of sliding windows arxiv.org/abs/2503.14393

On the clustering behavior of sliding windows

Things can go spectacularly wrong when clustering timeseries data that has been preprocessed with a sliding window. We highlight three surprising failures that emerge depending on how the window size ...

arxiv.org

March 19, 2025 at 2:48 PM

Reposted by David H Brown

Dr Sasha Luccioni

@sashamtl.bsky.social

moar memes

January 29, 2025 at 7:10 PM

Reposted by David H Brown

Jack Santucci

@jacksantucci.bsky.social

This is a super teaching tool, whatever one’s views on PR, because it is interactive and sure to produce good discussion.

The New York Times @nytimes.com · Jan 14

In @nytopinion.nytimes.com:

“To escape our two-party trap, we need a better system of electing people to Congress: proportional representation” write Jesse Wegman and Lee Drutman.

Opinion | How to Fix America’s Two-Party Problem

Proportional representation could help restore American democracy.

www.nytimes.com

January 16, 2025 at 6:29 AM

Reposted by David H Brown

Samuel Müller

@sammuller.bsky.social

This might be the first time after 10 years that boosted trees are not the best default choice when working with data in tables.
Instead a pre-trained neural network is, the new TabPFN, as we just published in Nature 🎉

January 8, 2025 at 6:00 PM

Reposted by David H Brown

Singularity Games

@singularity.games

The Singularity Deck is a multiuse, universal playing card system that allows for an immense number of games to be played including modern and traditional card games. It currently consists of 20 suits all themed after the beginning and the end of the universe.
www.singularity.games/singularity-...

November 30, 2024 at 6:54 PM

Reposted by David H Brown

Singularity Games

@singularity.games

Modular Magnetic Boards are an ever-growing set of #3Dprinted tiles that let you play a huge number of games on a magnetically reconfigurable board. You can print your own or pick them up from Etsy: singularitygames.etsy.com

November 30, 2024 at 7:04 PM

Reposted by David H Brown

Maarten van Smeden

@maartenvsmeden.bsky.social

Introducing the new "NOPE" algorithm

This algorithm will tell you "no" all the time. It has been shown to be up to 95% accurate in situations with a prevalence of 5% and *what is even better* even *more accurate* in rarer diseases

December 2, 2024 at 3:01 PM

Reposted by David H Brown

Ryan Moulton

@moultano.bsky.social

🐦→🦋

The difference between "no evidence that it works," and "evidence that it doesn't work," is
1. extremely confused linguistically
2. extremely important epistemically
3. surprisingly continuous in practice.

The importance of a null study result depends entirely on the power.

November 25, 2024 at 4:54 PM

Reposted by David H Brown

Ian Goodfellow

@ian-goodfellow.bsky.social

Posting a call for help: does anyone know of a good way to simultaneously treat both POTS and Ménière’s disease? Please contact me if you’re either a clinician with experience doing this or a patient who has found a good solution. Context in thread

November 24, 2024 at 4:34 PM

Reposted by David H Brown

Alicia Curth

@aliciacurth.bsky.social

Part 2: Why do boosted trees outperform deep learning on tabular data??

@alanjeffares.bsky.social & I suspected that answers to this are obfuscated by the 2 being considered very different algs🤔

Instead we show they are more similar than you’d think — making their diffs smaller but predictive!🧵1/n

November 20, 2024 at 5:02 PM

Reposted by David H Brown

Alicia Curth

@aliciacurth.bsky.social

From double descent to grokking, deep learning sometimes works in unpredictable ways.. or does it?

For NeurIPS(my final PhD paper!), @alanjeffares.bsky.social & I explored if&how smart linearisation can help us better understand&predict numerous odd deep learning phenomena — and learned a lot..🧵1/n

November 18, 2024 at 7:25 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news