🔗 jouisseuse.github.io
Really appreciate @elisakreiss.bsky.social’s kind guidance and encouragement throughout this work 🙏
Really appreciate @elisakreiss.bsky.social’s kind guidance and encouragement throughout this work 🙏
📄 Paper: arxiv.org/abs/2509.04373
💻 Code: github.com/jouisseuse/B...
📄 Paper: arxiv.org/abs/2509.04373
💻 Code: github.com/jouisseuse/B...
This suggests that LLM benchmark behavior may generalize less and less to non-benchmark settings, raising new concerns about ecological validity.
This suggests that LLM benchmark behavior may generalize less and less to non-benchmark settings, raising new concerns about ecological validity.