LLM-as-judge 🦙🧑⚖️ correlates best with human judgment, though.
LLM-as-judge 🦙🧑⚖️ correlates best with human judgment, though.
Even the best models:
>40% accuracy on textual questions
<30% on visual questions
Often perform better in English than the local language (!!)
Visual QA with regional images is especially challenging.
Even the best models:
>40% accuracy on textual questions
<30% on visual questions
Often perform better in English than the local language (!!)
Visual QA with regional images is especially challenging.
We collected questions from native speakers in Czechia 🇨🇿, Slovakia 🇸🇰, and Ukraine 🇺🇦 about facts locals know but outsiders don't.
We collected questions from native speakers in Czechia 🇨🇿, Slovakia 🇸🇰, and Ukraine 🇺🇦 about facts locals know but outsiders don't.