Michael Saxon
@saxon.me
Doctor of NLP/Vision+Language from UCSB
Evals, metrics, multilinguality, multiculturality, multimodality, and (dabbling in) reasoning
https://saxon.me/
Evals, metrics, multilinguality, multiculturality, multimodality, and (dabbling in) reasoning
https://saxon.me/
Pinned
Michael Saxon
@saxon.me
· 5d
🆕 from us at #EMNLP: Are LMs better at answering questions about Germany in German than in French? Is national knowledge linguistically contingent?
Interestingly, only for some multilingual models is this true. Aya knows China best in Chinese, but LLaMA's best in English always.
Interestingly, only for some multilingual models is this true. Aya knows China best in Chinese, but LLaMA's best in English always.
Humanity is nothing without its humanity
November 11, 2025 at 10:00 AM
Humanity is nothing without its humanity
Reposted by Michael Saxon
Thanks to @uwnews.uw.edu for covering my + @aylincaliskan.bsky.social's recent work published at AIES 2025! www.washington.edu/news/2025/11...
People mirror AI systems’ hiring biases, study finds
In a new UW study, 528 participants worked with simulated AI systems to select job candidates. The researchers simulated different levels of racial biases for resumes from white, Black, Hispanic and.....
www.washington.edu
November 10, 2025 at 7:28 PM
Thanks to @uwnews.uw.edu for covering my + @aylincaliskan.bsky.social's recent work published at AIES 2025! www.washington.edu/news/2025/11...
Guys I'm really worried about the threat of superintelligent AI, and wouldn't you know it, the best way to stop it is gonna be for you to give me a whole lotta money for my startup
November 10, 2025 at 6:35 AM
Guys I'm really worried about the threat of superintelligent AI, and wouldn't you know it, the best way to stop it is gonna be for you to give me a whole lotta money for my startup
More than choosing good project ideas, to me "research taste" means recognizing what the interesting part of a result is and how it connects to a bigger narrative. Almost any nontrivial result can be important within the right lens.
More than anything my PhD taught me this.
More than anything my PhD taught me this.
November 5, 2025 at 8:25 PM
More than choosing good project ideas, to me "research taste" means recognizing what the interesting part of a result is and how it connects to a bigger narrative. Almost any nontrivial result can be important within the right lens.
More than anything my PhD taught me this.
More than anything my PhD taught me this.
🆕 from us at #EMNLP: Are LMs better at answering questions about Germany in German than in French? Is national knowledge linguistically contingent?
Interestingly, only for some multilingual models is this true. Aya knows China best in Chinese, but LLaMA's best in English always.
Interestingly, only for some multilingual models is this true. Aya knows China best in Chinese, but LLaMA's best in English always.
November 5, 2025 at 7:47 PM
🆕 from us at #EMNLP: Are LMs better at answering questions about Germany in German than in French? Is national knowledge linguistically contingent?
Interestingly, only for some multilingual models is this true. Aya knows China best in Chinese, but LLaMA's best in English always.
Interestingly, only for some multilingual models is this true. Aya knows China best in Chinese, but LLaMA's best in English always.
Beautiful Blooj tears. Mariners will not have to feel the "should have been us" pain
Hate da Dodgers but also Bloojays need to pay for knocking out Seattle. I'll be smug whoever loses.
November 2, 2025 at 4:23 AM
Beautiful Blooj tears. Mariners will not have to feel the "should have been us" pain
Hate da Dodgers but also Bloojays need to pay for knocking out Seattle. I'll be smug whoever loses.
November 2, 2025 at 4:11 AM
Hate da Dodgers but also Bloojays need to pay for knocking out Seattle. I'll be smug whoever loses.
Very pro dislike button. I think the asymmetry that comes from being able to leave drive-by approval (likes) but only high-engagement disapproval (comments) raises the temperature of negative interactions, and incentivizes ragebaiting with less visible shame
Threads on Bluesky should feel more like conversations you’d have IRL.
We’re testing new systems to improve reply quality. See what’s coming: bsky.social/about/blog/1...
We’re testing new systems to improve reply quality. See what’s coming: bsky.social/about/blog/1...
Progress Update: Building Healthier Social Media - Bluesky
Over the next few months, we’ll be iterating on the systems that make Bluesky a better place for healthy conversations. Some experiments will stick, others will evolve, and we’ll share what we learn a...
bsky.social
October 31, 2025 at 11:54 PM
Very pro dislike button. I think the asymmetry that comes from being able to leave drive-by approval (likes) but only high-engagement disapproval (comments) raises the temperature of negative interactions, and incentivizes ragebaiting with less visible shame
I didn't realize arXiv is a postprint server
blog.arxiv.org/2025/10/31/a...
FYI the blog post for the updated policy is out. Our llm future is dire:/
FYI the blog post for the updated policy is out. Our llm future is dire:/
October 31, 2025 at 7:55 PM
I didn't realize arXiv is a postprint server
It's #NSF #GRFP application season again so it's time to re-up my GRFP application advice post!
Also, check out the cool bsky comment integration I've added to the blog! Engagement with this post will go under the blogpost on my site as comments!
saxon.me/blog/2024/gr...
Also, check out the cool bsky comment integration I've added to the blog! Engagement with this post will go under the blogpost on my site as comments!
saxon.me/blog/2024/gr...
NSF GRFP Application Tips for NLP, AI, CS
Reflections and advice from my successful NSF GRFP proposal in NLP. Why I think my applications worked well, what I wish I did differently, and links to my actual statements and feedback from the GRFP...
saxon.me
October 30, 2025 at 8:03 PM
It's #NSF #GRFP application season again so it's time to re-up my GRFP application advice post!
Also, check out the cool bsky comment integration I've added to the blog! Engagement with this post will go under the blogpost on my site as comments!
saxon.me/blog/2024/gr...
Also, check out the cool bsky comment integration I've added to the blog! Engagement with this post will go under the blogpost on my site as comments!
saxon.me/blog/2024/gr...
"It's country over club as Aaron Judge will lead team USA in the World Baseball Classic next March, assuming this game will be over by then" 💀
October 28, 2025 at 6:00 AM
"It's country over club as Aaron Judge will lead team USA in the World Baseball Classic next March, assuming this game will be over by then" 💀
Reposted by Michael Saxon
fukuyama was right. history ended. we are stuck inside this game forever
October 28, 2025 at 5:28 AM
fukuyama was right. history ended. we are stuck inside this game forever
WAITER? ANOTHER INNING OF NOTHING PLEASE
October 28, 2025 at 5:00 AM
WAITER? ANOTHER INNING OF NOTHING PLEASE
It's live! Here's an example post: saxon.me/blog/2025/la...
Turning the replies to a bluesky post into the comment section for a blogpost is a small concrete way to support the ecosystem: future visitors who want to add comments incentivized to interact on the platform
Also, it's very easy to do:
Turning the replies to a bluesky post into the comment section for a blogpost is a small concrete way to support the ecosystem: future visitors who want to add comments incentivized to interact on the platform
Also, it's very easy to do:
October 27, 2025 at 6:50 PM
It's live! Here's an example post: saxon.me/blog/2025/la...
Turning the replies to a bluesky post into the comment section for a blogpost is a small concrete way to support the ecosystem: future visitors who want to add comments incentivized to interact on the platform
Also, it's very easy to do:
Turning the replies to a bluesky post into the comment section for a blogpost is a small concrete way to support the ecosystem: future visitors who want to add comments incentivized to interact on the platform
Also, it's very easy to do:
Stage four: The sign bears no relation to any reality whatsoever; it is its own pure simulacrum
youtu.be/6i2I3dkZ5-M
youtu.be/6i2I3dkZ5-M
Bush Step! (JibJab)
YouTube video by pipo
youtu.be
October 27, 2025 at 4:04 PM
Stage four: The sign bears no relation to any reality whatsoever; it is its own pure simulacrum
youtu.be/6i2I3dkZ5-M
youtu.be/6i2I3dkZ5-M
Prototyping bluesky comment integrations for the blog (gonna need to modify a lot more to make it fully work with my tempalte)
Also, I am getting more and more indiewebpilled. Would any other NLPMLAI researcher-bloggers be interested in making a webring?
Also, I am getting more and more indiewebpilled. Would any other NLPMLAI researcher-bloggers be interested in making a webring?
October 27, 2025 at 7:27 AM
Prototyping bluesky comment integrations for the blog (gonna need to modify a lot more to make it fully work with my tempalte)
Also, I am getting more and more indiewebpilled. Would any other NLPMLAI researcher-bloggers be interested in making a webring?
Also, I am getting more and more indiewebpilled. Would any other NLPMLAI researcher-bloggers be interested in making a webring?
A big part of why talking about LMs vs humans is hard is interlocutors often don't agree if they're considering idealized, worst-case, or "average" humans or LMs
Personally, I think idealized human vs average LM is most germane set to use to think about capabilities
Personally, I think idealized human vs average LM is most germane set to use to think about capabilities
October 26, 2025 at 7:31 AM
A big part of why talking about LMs vs humans is hard is interlocutors often don't agree if they're considering idealized, worst-case, or "average" humans or LMs
Personally, I think idealized human vs average LM is most germane set to use to think about capabilities
Personally, I think idealized human vs average LM is most germane set to use to think about capabilities
Reposted by Michael Saxon
Federal Judges— or staff, but same-same— used "AI" to summarize & draft rulings, issued them w/o checking the work, leading to basic factual errors, & thus undermining the facticity & validity of the rulings entire.
Gee. Who Could Have Foreseen. *stares directly into the camera like in the office*
Gee. Who Could Have Foreseen. *stares directly into the camera like in the office*
Two federal judges say use of AI led to errors in US court rulings
Two federal judges admitted in response to an inquiry by U.S. Senate Judiciary Committee Chairman Chuck Grassley that members of their staff used artificial intelligence to help prepare recent court orders that Grassley called "error-ridden."
www.reuters.com
October 26, 2025 at 4:47 AM
Federal Judges— or staff, but same-same— used "AI" to summarize & draft rulings, issued them w/o checking the work, leading to basic factual errors, & thus undermining the facticity & validity of the rulings entire.
Gee. Who Could Have Foreseen. *stares directly into the camera like in the office*
Gee. Who Could Have Foreseen. *stares directly into the camera like in the office*
Even though it should feel pointless to care about this as our republic crumbles but somehow this shit does still make my blood boil
www.404media.co/a16z-backed-...
www.404media.co/a16z-backed-...
a16z-Backed Startup Sells Thousands of ‘Synthetic Influencers’ to Manipulate Social Media as a Service
Andreessen Horowitz is funding a company that clearly violates the inauthentic behavior policies of every major social media platform.
www.404media.co
October 25, 2025 at 6:21 AM
Even though it should feel pointless to care about this as our republic crumbles but somehow this shit does still make my blood boil
www.404media.co/a16z-backed-...
www.404media.co/a16z-backed-...
nightmare blunt rotation
The list of signatories on the latest "Please Skynet, don't kill us" letter is BONKERS.
October 22, 2025 at 8:31 PM
nightmare blunt rotation
Reposted by Michael Saxon
🚨New paper: Reward Models (RMs) are used to align LLMs, but can they be steered toward user-specific value/style preferences?
With EVALUESTEER, we find even the best RMs we tested exhibit their own value/style biases, and are unable to align with a user >25% of the time. 🧵
With EVALUESTEER, we find even the best RMs we tested exhibit their own value/style biases, and are unable to align with a user >25% of the time. 🧵
October 14, 2025 at 3:59 PM
🚨New paper: Reward Models (RMs) are used to align LLMs, but can they be steered toward user-specific value/style preferences?
With EVALUESTEER, we find even the best RMs we tested exhibit their own value/style biases, and are unable to align with a user >25% of the time. 🧵
With EVALUESTEER, we find even the best RMs we tested exhibit their own value/style biases, and are unable to align with a user >25% of the time. 🧵
Reposted by Michael Saxon
Happy to share that I’m presenting 3 research projects at AIES 2025 🎉
1️⃣Gender bias over-representation in AI bias research 👫
2️⃣Stable Diffusion's skin tone bias 🧑🏻🧑🏽🧑🏿
3️⃣Limitations of human oversight in AI hiring 👤🤖
Let's chat if you’re at AIES or read below/reach out for details!
#AIES25 #AcademicSky
1️⃣Gender bias over-representation in AI bias research 👫
2️⃣Stable Diffusion's skin tone bias 🧑🏻🧑🏽🧑🏿
3️⃣Limitations of human oversight in AI hiring 👤🤖
Let's chat if you’re at AIES or read below/reach out for details!
#AIES25 #AcademicSky
October 21, 2025 at 11:39 AM
Happy to share that I’m presenting 3 research projects at AIES 2025 🎉
1️⃣Gender bias over-representation in AI bias research 👫
2️⃣Stable Diffusion's skin tone bias 🧑🏻🧑🏽🧑🏿
3️⃣Limitations of human oversight in AI hiring 👤🤖
Let's chat if you’re at AIES or read below/reach out for details!
#AIES25 #AcademicSky
1️⃣Gender bias over-representation in AI bias research 👫
2️⃣Stable Diffusion's skin tone bias 🧑🏻🧑🏽🧑🏿
3️⃣Limitations of human oversight in AI hiring 👤🤖
Let's chat if you’re at AIES or read below/reach out for details!
#AIES25 #AcademicSky
we need to finally replace Susan Collins with a Democrat!
*The monkey's paw curls*
*The monkey's paw curls*
October 21, 2025 at 7:37 PM
we need to finally replace Susan Collins with a Democrat!
*The monkey's paw curls*
*The monkey's paw curls*
Reposted by Michael Saxon