Nice work, @lafd.bsky.social #LAFD c/o @nbcla.bsky.social 👨🏼🚒
“Asking o1 to complete proofs in creative ways is effectively asking it to be a research colleague. The model doesn't have to get proofs right to be useful, it just has to help us be better researchers.”
Good example of utility that evals fail to capture.
They aren’t AGI, but will matter. www.oneusefulthing.org/p/what-just-...
“Asking o1 to complete proofs in creative ways is effectively asking it to be a research colleague. The model doesn't have to get proofs right to be useful, it just has to help us be better researchers.”
Good example of utility that evals fail to capture.
GPT-4 got 37% at the start of 2024. o1 got 78%. o3 is 87.7%
GPT-4 got 37% at the start of 2024. o1 got 78%. o3 is 87.7%
I feel like Twitter and LinkedIn and Instagram and TikTok have pushed a lot of people out of the habit of doing that, by penalizing shared links in the various "algorithms"
Bluesky doesn't have that misfeature, thankfully!
I feel like Twitter and LinkedIn and Instagram and TikTok have pushed a lot of people out of the habit of doing that, by penalizing shared links in the various "algorithms"
Bluesky doesn't have that misfeature, thankfully!
The evals are perhaps the final nail in the coffin for the scaling wall hypothesis, showing that AI models aren’t hitting a plateau in capabilities.
arcprize.org/blog/oai-o3-...
The evals are perhaps the final nail in the coffin for the scaling wall hypothesis, showing that AI models aren’t hitting a plateau in capabilities.
arcprize.org/blog/oai-o3-...
elevenlabs.io/blog/introdu...
elevenlabs.io/blog/introdu...