Full support from @nsfsimonscosmicai.bsky.social.
🌐: astrovisbench.github.io
📄: arxiv.org/abs/2505.20538
Full support from @nsfsimonscosmicai.bsky.social.
🌐: astrovisbench.github.io
📄: arxiv.org/abs/2505.20538
SOTA models including Gemini 2.5 Pro, Claude Opus 4, o3-mini and QwQ crash 30-60% of the time and only produce visualizations without error in less than 16% of the cases.
SOTA models including Gemini 2.5 Pro, Claude Opus 4, o3-mini and QwQ crash 30-60% of the time and only produce visualizations without error in less than 16% of the cases.
Processing tasks: we compare key variable values.
Visualizations: we use a VLM judge (well correlated w/ pro astronomers) that compares a visualization’s scientific utility to that of the ground truth.
Processing tasks: we compare key variable values.
Visualizations: we use a VLM judge (well correlated w/ pro astronomers) that compares a visualization’s scientific utility to that of the ground truth.