Writes at https://olickel.com
The output was >1M tokens, and it wasn't an easy task.
The output was >1M tokens, and it wasn't an easy task.
at least 70% of problems I've seen can be fixed by improving one of these areas.
bsky.app/profile/dan...
at least 70% of problems I've seen can be fixed by improving one of these areas.
bsky.app/profile/dan...
x.com/StellaLisy/...
x.com/StellaLisy/...
Finally, about the random rewards improving benchmark performance: It's the clipping term.
Finally, about the random rewards improving benchmark performance: It's the clipping term.
1. Test code-as-reasoning pathways (no code interpreter, interleaved in thinking instead of as a toolcall or output).
2. Measure model aptitude and performance in thinking with code vs without.
1. Test code-as-reasoning pathways (no code interpreter, interleaved in thinking instead of as a toolcall or output).
2. Measure model aptitude and performance in thinking with code vs without.
Most flows today treat NL as reasoning, code as execution, and structured data as an extraction method. There might be problems with this approach.
Most flows today treat NL as reasoning, code as execution, and structured data as an extraction method. There might be problems with this approach.
This is heavily relevant to the work we're doing, which involves transitioning between NL reasoning and code boundaries repeatedly.
This is heavily relevant to the work we're doing, which involves transitioning between NL reasoning and code boundaries repeatedly.