Hrishi
olickel.com
Hrishi
@olickel.com
Previously CTO, Greywing (YC W21). Building something new at the moment.

Writes at https://olickel.com
What do I mean by agentic model? Very simply put, it's the ability to hold macro instructions in view across an increasing number of turns, and to use primary tools (read, write, edit, shell) consistently without getting lost. Added bonus is the ability to learn from mistakes further up the chain!
July 13, 2025 at 6:09 PM
Editing multiple files, reading new context, maintaining agentic state (not forgetting where you were), or forgetting instructions. This is repo with included prompts, notes, plans, lots of things to mistake for context and be poisoned by.

The output was >1M tokens, and it wasn't an easy task.
July 13, 2025 at 6:09 PM
Definitely going to make one! I'm looking forward to re-reading the report and making one about what we learned
June 2, 2025 at 2:44 PM
This looks really interesting btw, haven't tried yet

bsky.app/profile/jps...
Justin Schroeder (@jpschroeder.com)
I wrote a library to scratch this itch: jsonreader.formkit.com https://jsonreader.formkit.com
bsky.app
June 1, 2025 at 4:15 PM
Oh and NL vs Code outputs.

at least 70% of problems I've seen can be fixed by improving one of these areas.

bsky.app/profile/dan...
June 1, 2025 at 4:15 PM
Full paper and results - also can we normalise releasing preprints in Notion? So much easier to read, annotate and understand!
x.com/StellaLisy/...
May 29, 2025 at 6:27 AM
GRPO clips impact based on token probability. Lower prob tokens can move less than higher prob tokens. This means that even with random rewards (especially so), models push more into what was in-distribution. For -MATH, this is code - It thinks better in code. Therefore it gets better overall.
May 29, 2025 at 6:27 AM
3. Find new ways of measuring rewards signals from NL reasoning, instead of switching to code or structured output for measurement (which can corrupt results).

Finally, about the random rewards improving benchmark performance: It's the clipping term.
May 29, 2025 at 6:27 AM
Actionable takeaways for us:
1. Test code-as-reasoning pathways (no code interpreter, interleaved in thinking instead of as a toolcall or output).
2. Measure model aptitude and performance in thinking with code vs without.
May 29, 2025 at 6:27 AM
Honestly it's relevant to almost all work - most agentic flows have 10-20 transitions (sometimes more) per loop.

Most flows today treat NL as reasoning, code as execution, and structured data as an extraction method. There might be problems with this approach.
May 29, 2025 at 6:27 AM
Trying different problems on multiple models, there's a distinct difference in answer and reasoning quality in code vs NL.

This is heavily relevant to the work we're doing, which involves transitioning between NL reasoning and code boundaries repeatedly.
May 29, 2025 at 6:27 AM
Testing this locally surprised me too. Something is definitely happening here - and it's also apparent when testing Opus vs Sonnet 4. Models reason very, VERY differently when using code vs natural language - displaying very different aptitudes working through the same problem.
May 29, 2025 at 6:27 AM
Now for the schemas, I agree with this assessment. Opus is the best for describing data - it has a way of being methodical that the other models (or tools) don't really have. They all managed to load the data properly, which is still a big leap.
May 24, 2025 at 6:46 PM
Here are the databases they came up with (claude code made this image).
May 24, 2025 at 6:46 PM