Lightnews — Scholar-powered news

Raj Movva

@rajmovva.bsky.social

240 followers 130 following 47 posts

NLP, ML & society, healthcare.
PhD student at Berkeley, previously CS at MIT.
https://rajivmovva.com/

Posts Replies Media Videos

Raj Movva

@rajmovva.bsky.social

This capability of discovering unknown concepts opens many opportunities for applied machine learning. We can design better whitebox predictors, better audit high-stakes models for bias, and generate hypotheses for CSS research. More broadly, SAEs can help bridge the "prediction-explanation" gap.

August 5, 2025 at 4:33 PM

Raj Movva

@rajmovva.bsky.social

These tasks lie in contrast to probing, where we're trying to predict the presence of a *known* concept; and steering, where we're trying to include a *known* concept in an LLM output. SAEs lose to simple baselines on these tasks. (2 good papers on this: "AxBench" and Kantamneni, Engels et al. 2025)

August 5, 2025 at 4:33 PM

Raj Movva

@rajmovva.bsky.social

How do we reconcile our view with recent negative results? Our key distinction is that SAEs are useful when you don't know what you're looking for: how does my text classifier predict which headlines will go viral? How does my LLM perform addition? These are "unknown unknowns".

August 5, 2025 at 4:33 PM

Raj Movva

@rajmovva.bsky.social

📢New POSITION PAPER: Use Sparse Autoencoders to Discover Unknown Concepts, Not to Act on Known Concepts

Despite recent results, SAEs aren't dead! They can still be useful to mech interp, and also much more broadly: across FAccT, computational social science, and ML4H. 🧵

August 5, 2025 at 4:33 PM

Raj Movva

@rajmovva.bsky.social

It was lots of fun to co-lead this with @kennypeng.bsky.social, with coauthors @nkgarg.bsky.social, Jon Kleinberg, and @emmapierson.bsky.social! Feel free to reach out if we can be helpful. Links:

Draft: arxiv.org/abs/2502.04382
Python package: github.com/rmovva/Hypot...
Demo: hypothesaes.org

9/9

March 18, 2025 at 3:17 PM

Raj Movva

@rajmovva.bsky.social

@kennypeng.bsky.social also built a website to explore results on Yelp, headlines, & Congress datasets: hypothesaes.org.

You can see every SAE neuron in UMAP space, colored by whether the neuron correlates positively or negatively with the target variable. 8/

March 18, 2025 at 3:17 PM

Raj Movva

@rajmovva.bsky.social

These natural language interpretations are our hypotheses, which we validate on held-out data. On well-studied social science datasets, we add to prior work: for example, we find that news headlines about social issues or the environment decrease engagement. 3/

March 18, 2025 at 3:17 PM

Raj Movva

@rajmovva.bsky.social

Starting from blackbox text embeddings—which are expressive, but uninterpretable—we (1) use SAEs to map the embeddings to an interpretable space; (2) filter for features that predict the target (engagement, etc.); (3) interpret those features using LLMs. 2/

March 18, 2025 at 3:17 PM

Raj Movva

@rajmovva.bsky.social

💡New preprint & Python package: We use sparse autoencoders to generate hypotheses from large text datasets.

Our method, HypotheSAEs, produces interpretable text features that predict a target variable, e.g. features in news headlines that predict engagement. 🧵1/

March 18, 2025 at 3:17 PM

Raj Movva

@rajmovva.bsky.social

An interesting dynamic on Google reviews is that I tend to trust businesses which write custom replies to negative reviews, even if the reply doesn't really resolve the issue (e.g. see below). I distrust businesses which have stock replies, even if those replies offer to resolve the issue somehow.

March 15, 2025 at 8:57 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news