Our model-level safety score is defined as % of safety facts 100% passed all test scenario prompts (~100 scenarios per safety fact).
5/🧵
Our model-level safety score is defined as % of safety facts 100% passed all test scenario prompts (~100 scenarios per safety fact).
5/🧵
Generalization failures are dangerous when users ask naive questions.
1/🧵
Generalization failures are dangerous when users ask naive questions.
1/🧵
Joint work with @guydav.bsky.social @brendenlake.bsky.social
🧵 starts below!
Joint work with @guydav.bsky.social @brendenlake.bsky.social
🧵 starts below!
Introducing our work SAGE-Eval, a benchmark consisting of 100+ safety facts and 10k+ scenarios to test this!
- Claude-3.7-Sonnet passes only 57% of facts evaluated
- o1 and o3-mini passed <45%! 🧵
Introducing our work SAGE-Eval, a benchmark consisting of 100+ safety facts and 10k+ scenarios to test this!
- Claude-3.7-Sonnet passes only 57% of facts evaluated
- o1 and o3-mini passed <45%! 🧵