SOTA models including Gemini 2.5 Pro, Claude Opus 4, o3-mini and QwQ crash 30-60% of the time and only produce visualizations without error in less than 16% of the cases.
SOTA models including Gemini 2.5 Pro, Claude Opus 4, o3-mini and QwQ crash 30-60% of the time and only produce visualizations without error in less than 16% of the cases.
Processing tasks: we compare key variable values.
Visualizations: we use a VLM judge (well correlated w/ pro astronomers) that compares a visualization’s scientific utility to that of the ground truth.
Processing tasks: we compare key variable values.
Visualizations: we use a VLM judge (well correlated w/ pro astronomers) that compares a visualization’s scientific utility to that of the ground truth.
AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results.
SOTA models like Gemini 2.5 Pro & Claude 4 Opus only match ground truth scientific utility 16% of the time. 🧵
AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results.
SOTA models like Gemini 2.5 Pro & Claude 4 Opus only match ground truth scientific utility 16% of the time. 🧵