Michael Saxon
@saxon.me
Doctor of NLP/Vision+Language from UCSB
Evals, metrics, multilinguality, multiculturality, multimodality, and (dabbling in) reasoning
https://saxon.me/
Evals, metrics, multilinguality, multiculturality, multimodality, and (dabbling in) reasoning
https://saxon.me/
🆕 from us at #EMNLP: Are LMs better at answering questions about Germany in German than in French? Is national knowledge linguistically contingent?
Interestingly, only for some multilingual models is this true. Aya knows China best in Chinese, but LLaMA's best in English always.
Interestingly, only for some multilingual models is this true. Aya knows China best in Chinese, but LLaMA's best in English always.
November 5, 2025 at 7:47 PM
🆕 from us at #EMNLP: Are LMs better at answering questions about Germany in German than in French? Is national knowledge linguistically contingent?
Interestingly, only for some multilingual models is this true. Aya knows China best in Chinese, but LLaMA's best in English always.
Interestingly, only for some multilingual models is this true. Aya knows China best in Chinese, but LLaMA's best in English always.
It's live! Here's an example post: saxon.me/blog/2025/la...
Turning the replies to a bluesky post into the comment section for a blogpost is a small concrete way to support the ecosystem: future visitors who want to add comments incentivized to interact on the platform
Also, it's very easy to do:
Turning the replies to a bluesky post into the comment section for a blogpost is a small concrete way to support the ecosystem: future visitors who want to add comments incentivized to interact on the platform
Also, it's very easy to do:
October 27, 2025 at 6:50 PM
It's live! Here's an example post: saxon.me/blog/2025/la...
Turning the replies to a bluesky post into the comment section for a blogpost is a small concrete way to support the ecosystem: future visitors who want to add comments incentivized to interact on the platform
Also, it's very easy to do:
Turning the replies to a bluesky post into the comment section for a blogpost is a small concrete way to support the ecosystem: future visitors who want to add comments incentivized to interact on the platform
Also, it's very easy to do:
Prototyping bluesky comment integrations for the blog (gonna need to modify a lot more to make it fully work with my tempalte)
Also, I am getting more and more indiewebpilled. Would any other NLPMLAI researcher-bloggers be interested in making a webring?
Also, I am getting more and more indiewebpilled. Would any other NLPMLAI researcher-bloggers be interested in making a webring?
October 27, 2025 at 7:27 AM
Prototyping bluesky comment integrations for the blog (gonna need to modify a lot more to make it fully work with my tempalte)
Also, I am getting more and more indiewebpilled. Would any other NLPMLAI researcher-bloggers be interested in making a webring?
Also, I am getting more and more indiewebpilled. Would any other NLPMLAI researcher-bloggers be interested in making a webring?
I don't think this was malicious. There are real papers by the same authors.
(Canivez and Youngstrom, 2019) and (Wasserman, 2019) do exist. Problem is they have different titles and are in different journals.
Don't generate your references folks!
(Canivez and Youngstrom, 2019) and (Wasserman, 2019) do exist. Problem is they have different titles and are in different journals.
Don't generate your references folks!
October 18, 2025 at 12:54 AM
I don't think this was malicious. There are real papers by the same authors.
(Canivez and Youngstrom, 2019) and (Wasserman, 2019) do exist. Problem is they have different titles and are in different journals.
Don't generate your references folks!
(Canivez and Youngstrom, 2019) and (Wasserman, 2019) do exist. Problem is they have different titles and are in different journals.
Don't generate your references folks!
The viral "Definition of AGI" paper tells you to read fake references which do not exist!
Proof: different articles present at the specified journal/volume/page number, and their titles exist nowhere on any searchable repository.
Take this as a warning to not use LMs to generate your references!
Proof: different articles present at the specified journal/volume/page number, and their titles exist nowhere on any searchable repository.
Take this as a warning to not use LMs to generate your references!
October 18, 2025 at 12:54 AM
The viral "Definition of AGI" paper tells you to read fake references which do not exist!
Proof: different articles present at the specified journal/volume/page number, and their titles exist nowhere on any searchable repository.
Take this as a warning to not use LMs to generate your references!
Proof: different articles present at the specified journal/volume/page number, and their titles exist nowhere on any searchable repository.
Take this as a warning to not use LMs to generate your references!
He's so Reddit it's unbearable 😭
October 15, 2025 at 5:38 PM
He's so Reddit it's unbearable 😭
Yeah...idk if I have it in me to listen to this
October 15, 2025 at 4:20 PM
Yeah...idk if I have it in me to listen to this
hey don't make fun of the galileo of LLMs
September 19, 2025 at 4:39 AM
hey don't make fun of the galileo of LLMs
On T2I-generated images, it is good at predicting the judgments of human raters from 10 countries of an image’s relevance to their own culture compared to a set of simple baselines.
AIRe can be used to grade the "stylistic aspects" of a fantasy entity, not just match real stuff 4/5
AIRe can be used to grade the "stylistic aspects" of a fantasy entity, not just match real stuff 4/5
June 20, 2025 at 11:02 PM
On T2I-generated images, it is good at predicting the judgments of human raters from 10 countries of an image’s relevance to their own culture compared to a set of simple baselines.
AIRe can be used to grade the "stylistic aspects" of a fantasy entity, not just match real stuff 4/5
AIRe can be used to grade the "stylistic aspects" of a fantasy entity, not just match real stuff 4/5
As far as we can tell, CAIRe works quite well. It is very performant at identifying the cultural origins of 𝗿𝗲𝗮𝗹, 𝗿𝗮𝗿𝗲 𝗲𝗻𝘁𝗶𝘁𝗶𝗲𝘀 based on many proxies, including country, region, religion, ethnicity, and even ancient civilizations.
3/5
3/5
June 20, 2025 at 11:02 PM
As far as we can tell, CAIRe works quite well. It is very performant at identifying the cultural origins of 𝗿𝗲𝗮𝗹, 𝗿𝗮𝗿𝗲 𝗲𝗻𝘁𝗶𝘁𝗶𝗲𝘀 based on many proxies, including country, region, religion, ethnicity, and even ancient civilizations.
3/5
3/5
Our metric CAIRe (Cultural Attribution of Images with Retrieval) scores an input image using image retrieval over a multimodal KG and LM likelihood scores over entry data to assign cultural relevance scores to 𝐚𝐧𝐲 set of cultural labels based on 𝐚𝐧𝐲 cultural proxy (not just countries!). 2/5
June 20, 2025 at 11:02 PM
Our metric CAIRe (Cultural Attribution of Images with Retrieval) scores an input image using image retrieval over a multimodal KG and LM likelihood scores over entry data to assign cultural relevance scores to 𝐚𝐧𝐲 set of cultural labels based on 𝐚𝐧𝐲 cultural proxy (not just countries!). 2/5
Multicultural text-to-image work requires costly, subjective human evaluation. Some of my projects have stalled because no automated, quantified "visual cultural attribution" metric existed.
BITS undergrads Siddharth and Arnav Yayavaram, @simi97k.bsky.social, @gneubig.bsky.social, and I made one.1/
BITS undergrads Siddharth and Arnav Yayavaram, @simi97k.bsky.social, @gneubig.bsky.social, and I made one.1/
June 20, 2025 at 11:02 PM
Multicultural text-to-image work requires costly, subjective human evaluation. Some of my projects have stalled because no automated, quantified "visual cultural attribution" metric existed.
BITS undergrads Siddharth and Arnav Yayavaram, @simi97k.bsky.social, @gneubig.bsky.social, and I made one.1/
BITS undergrads Siddharth and Arnav Yayavaram, @simi97k.bsky.social, @gneubig.bsky.social, and I made one.1/
To be honest, I kinda love grok? (when it isn't being Elonbotomized to be a racism machine)
So many rightoid maniacs query it expecting to see their conspiracist beliefs echoed back at them only to repeatedly get gently corrected with factual information lmao
So many rightoid maniacs query it expecting to see their conspiracist beliefs echoed back at them only to repeatedly get gently corrected with factual information lmao
May 30, 2025 at 4:01 AM
To be honest, I kinda love grok? (when it isn't being Elonbotomized to be a racism machine)
So many rightoid maniacs query it expecting to see their conspiracist beliefs echoed back at them only to repeatedly get gently corrected with factual information lmao
So many rightoid maniacs query it expecting to see their conspiracist beliefs echoed back at them only to repeatedly get gently corrected with factual information lmao
PSA for NAACL peeps from a southwest boi (sadly I won't be there): be sure to find a place to eat New Mexico style stacked enchiladas. You can get it "Christmas style" where its served with both red and green hatch chile. The hatch chile is integral, do not skip. Not photogenic, but very delicious
April 29, 2025 at 6:56 AM
PSA for NAACL peeps from a southwest boi (sadly I won't be there): be sure to find a place to eat New Mexico style stacked enchiladas. You can get it "Christmas style" where its served with both red and green hatch chile. The hatch chile is integral, do not skip. Not photogenic, but very delicious
I wondered if it could really be all that bad from the beginning, after all users are signing up to publicly interact with each other on a forum but woof, I don't think I would have signed off on this broad of a "the LM is allowed to impersonate this" policy
April 29, 2025 at 3:36 AM
I wondered if it could really be all that bad from the beginning, after all users are signing up to publicly interact with each other on a forum but woof, I don't think I would have signed off on this broad of a "the LM is allowed to impersonate this" policy
My stomach dropped when I saw the amount of quotes and replies... and a lot of the replies are about as aggressive and facile as I expected. note to self don't use the phrase "a n t i - A I" in a post lol
April 26, 2025 at 8:01 PM
My stomach dropped when I saw the amount of quotes and replies... and a lot of the replies are about as aggressive and facile as I expected. note to self don't use the phrase "a n t i - A I" in a post lol
Across multiple RMs, Terminator calibrates performance, getting near-optimal performance in significantly fewer tokens.
Most interestingly, our model-predicted deadlines find the OPTIMAL budget, near the plateau where further spend isn't beneficial
In this way Terminator is a tool any RM can use!
Most interestingly, our model-predicted deadlines find the OPTIMAL budget, near the plateau where further spend isn't beneficial
In this way Terminator is a tool any RM can use!
April 21, 2025 at 11:21 PM
Across multiple RMs, Terminator calibrates performance, getting near-optimal performance in significantly fewer tokens.
Most interestingly, our model-predicted deadlines find the OPTIMAL budget, near the plateau where further spend isn't beneficial
In this way Terminator is a tool any RM can use!
Most interestingly, our model-predicted deadlines find the OPTIMAL budget, near the plateau where further spend isn't beneficial
In this way Terminator is a tool any RM can use!
Finally, we introduce Thought Terminator, our Schwarzeneggerian method to mitigating overthinking, which is a modified decoder that inserts interrupts every N tokens to tell the model how much compute it has left. Once that budget is spent it uses constrained decoding for budget forcing.
April 21, 2025 at 11:21 PM
Finally, we introduce Thought Terminator, our Schwarzeneggerian method to mitigating overthinking, which is a modified decoder that inserts interrupts every N tokens to tell the model how much compute it has left. Once that budget is spent it uses constrained decoding for budget forcing.
In order to sample a more balanced distribution of questions across the difficulty spectrum, we introduce DUMB500, the Waluigi to MATH500 which consists of stupid easy Qs.
This way we can get a more comprehensive view of overthinking, from the hardest GPQA and ZebraLogic Qs to literally "2+2=?"
This way we can get a more comprehensive view of overthinking, from the hardest GPQA and ZebraLogic Qs to literally "2+2=?"
April 21, 2025 at 11:21 PM
In order to sample a more balanced distribution of questions across the difficulty spectrum, we introduce DUMB500, the Waluigi to MATH500 which consists of stupid easy Qs.
This way we can get a more comprehensive view of overthinking, from the hardest GPQA and ZebraLogic Qs to literally "2+2=?"
This way we can get a more comprehensive view of overthinking, from the hardest GPQA and ZebraLogic Qs to literally "2+2=?"
Our measure of overthinking is stupid simple: what's the delta between the mean/max token spend on each question vs the minimum for successful answers.
There exists a clear trend between question difficulty (measured by success rates) and required spend.
There exists a clear trend between question difficulty (measured by success rates) and required spend.
April 21, 2025 at 11:21 PM
Our measure of overthinking is stupid simple: what's the delta between the mean/max token spend on each question vs the minimum for successful answers.
There exists a clear trend between question difficulty (measured by success rates) and required spend.
There exists a clear trend between question difficulty (measured by success rates) and required spend.
Check out our new paper on benchmarking and mitigating overthinking in reasoning models!
From a simple observational measure of overthinking, we introduce Thought Terminator, a black-box, training-free decoding technique where RMs set their own deadlines and follow them
arxiv.org/abs/2504.13367
From a simple observational measure of overthinking, we introduce Thought Terminator, a black-box, training-free decoding technique where RMs set their own deadlines and follow them
arxiv.org/abs/2504.13367
April 21, 2025 at 11:15 PM
Check out our new paper on benchmarking and mitigating overthinking in reasoning models!
From a simple observational measure of overthinking, we introduce Thought Terminator, a black-box, training-free decoding technique where RMs set their own deadlines and follow them
arxiv.org/abs/2504.13367
From a simple observational measure of overthinking, we introduce Thought Terminator, a black-box, training-free decoding technique where RMs set their own deadlines and follow them
arxiv.org/abs/2504.13367
of all the days to have a planned, 15+ hour university-wide power outage
March 28, 2025 at 12:02 AM
of all the days to have a planned, 15+ hour university-wide power outage