Lightnews — Scholar-powered news

Michael Saxon

@saxon.me

🆕 from us at #EMNLP: Are LMs better at answering questions about Germany in German than in French? Is national knowledge linguistically contingent?

Interestingly, only for some multilingual models is this true. Aya knows China best in Chinese, but LLaMA's best in English always.

November 5, 2025 at 7:47 PM

Michael Saxon

@saxon.me

November 1, 2025 at 6:01 AM

Michael Saxon

@saxon.me

It's live! Here's an example post: saxon.me/blog/2025/la...

Turning the replies to a bluesky post into the comment section for a blogpost is a small concrete way to support the ecosystem: future visitors who want to add comments incentivized to interact on the platform

Also, it's very easy to do:

October 27, 2025 at 6:50 PM

Michael Saxon

@saxon.me

Prototyping bluesky comment integrations for the blog (gonna need to modify a lot more to make it fully work with my tempalte)

Also, I am getting more and more indiewebpilled. Would any other NLPMLAI researcher-bloggers be interested in making a webring?

October 27, 2025 at 7:27 AM

Michael Saxon

@saxon.me

I don't think this was malicious. There are real papers by the same authors.

(Canivez and Youngstrom, 2019) and (Wasserman, 2019) do exist. Problem is they have different titles and are in different journals.

Don't generate your references folks!

October 18, 2025 at 12:54 AM

Michael Saxon

@saxon.me

The viral "Definition of AGI" paper tells you to read fake references which do not exist!

Proof: different articles present at the specified journal/volume/page number, and their titles exist nowhere on any searchable repository.

Take this as a warning to not use LMs to generate your references!

October 18, 2025 at 12:54 AM

Michael Saxon

@saxon.me

This one rocks

October 16, 2025 at 6:19 AM

Michael Saxon

@saxon.me

I lold

October 15, 2025 at 7:03 PM

Michael Saxon

@saxon.me

He's so Reddit it's unbearable 😭

October 15, 2025 at 5:38 PM

Michael Saxon

@saxon.me

Yeah...idk if I have it in me to listen to this

October 15, 2025 at 4:20 PM

Michael Saxon

@saxon.me

hey don't make fun of the galileo of LLMs

September 19, 2025 at 4:39 AM

Michael Saxon

@saxon.me

On T2I-generated images, it is good at predicting the judgments of human raters from 10 countries of an image’s relevance to their own culture compared to a set of simple baselines.

AIRe can be used to grade the "stylistic aspects" of a fantasy entity, not just match real stuff 4/5

June 20, 2025 at 11:02 PM

Michael Saxon

@saxon.me

As far as we can tell, CAIRe works quite well. It is very performant at identifying the cultural origins of 𝗿𝗲𝗮𝗹, 𝗿𝗮𝗿𝗲 𝗲𝗻𝘁𝗶𝘁𝗶𝗲𝘀 based on many proxies, including country, region, religion, ethnicity, and even ancient civilizations.

3/5

June 20, 2025 at 11:02 PM

Michael Saxon

@saxon.me

Our metric CAIRe (Cultural Attribution of Images with Retrieval) scores an input image using image retrieval over a multimodal KG and LM likelihood scores over entry data to assign cultural relevance scores to 𝐚𝐧𝐲 set of cultural labels based on 𝐚𝐧𝐲 cultural proxy (not just countries!). 2/5

June 20, 2025 at 11:02 PM

Michael Saxon

@saxon.me

Multicultural text-to-image work requires costly, subjective human evaluation. Some of my projects have stalled because no automated, quantified "visual cultural attribution" metric existed.

BITS undergrads Siddharth and Arnav Yayavaram, @simi97k.bsky.social, @gneubig.bsky.social, and I made one.1/

June 20, 2025 at 11:02 PM

Michael Saxon

@saxon.me

To be honest, I kinda love grok? (when it isn't being Elonbotomized to be a racism machine)

So many rightoid maniacs query it expecting to see their conspiracist beliefs echoed back at them only to repeatedly get gently corrected with factual information lmao

May 30, 2025 at 4:01 AM

Michael Saxon

@saxon.me

PSA for NAACL peeps from a southwest boi (sadly I won't be there): be sure to find a place to eat New Mexico style stacked enchiladas. You can get it "Christmas style" where its served with both red and green hatch chile. The hatch chile is integral, do not skip. Not photogenic, but very delicious

April 29, 2025 at 6:56 AM

Michael Saxon

@saxon.me

I wondered if it could really be all that bad from the beginning, after all users are signing up to publicly interact with each other on a forum but woof, I don't think I would have signed off on this broad of a "the LM is allowed to impersonate this" policy

April 29, 2025 at 3:36 AM

Michael Saxon

@saxon.me

My stomach dropped when I saw the amount of quotes and replies... and a lot of the replies are about as aggressive and facile as I expected. note to self don't use the phrase "a n t i - A I" in a post lol

April 26, 2025 at 8:01 PM

Michael Saxon

@saxon.me

Across multiple RMs, Terminator calibrates performance, getting near-optimal performance in significantly fewer tokens.

Most interestingly, our model-predicted deadlines find the OPTIMAL budget, near the plateau where further spend isn't beneficial

In this way Terminator is a tool any RM can use!

April 21, 2025 at 11:21 PM

Michael Saxon

@saxon.me

Finally, we introduce Thought Terminator, our Schwarzeneggerian method to mitigating overthinking, which is a modified decoder that inserts interrupts every N tokens to tell the model how much compute it has left. Once that budget is spent it uses constrained decoding for budget forcing.

April 21, 2025 at 11:21 PM

Michael Saxon

@saxon.me

In order to sample a more balanced distribution of questions across the difficulty spectrum, we introduce DUMB500, the Waluigi to MATH500 which consists of stupid easy Qs.

This way we can get a more comprehensive view of overthinking, from the hardest GPQA and ZebraLogic Qs to literally "2+2=?"

April 21, 2025 at 11:21 PM

Michael Saxon

@saxon.me

Our measure of overthinking is stupid simple: what's the delta between the mean/max token spend on each question vs the minimum for successful answers.

There exists a clear trend between question difficulty (measured by success rates) and required spend.

April 21, 2025 at 11:21 PM

Michael Saxon

@saxon.me

Check out our new paper on benchmarking and mitigating overthinking in reasoning models!

From a simple observational measure of overthinking, we introduce Thought Terminator, a black-box, training-free decoding technique where RMs set their own deadlines and follow them

arxiv.org/abs/2504.13367

A deepseek whale about to overthink until the Terminator tells it to answer right away.

April 21, 2025 at 11:15 PM

Michael Saxon

@saxon.me

of all the days to have a planned, 15+ hour university-wide power outage

March 28, 2025 at 12:02 AM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news