Lightnews — Scholar-powered news

Hrishi

@olickel.com

Trying to finish typing `git add` while the agent's editing a file just so I can preserve pristine diffs from the last change

November 14, 2025 at 11:17 PM

Hrishi

@olickel.com

Don't think there's a way I could like this article more

September 19, 2025 at 4:20 PM

Hrishi

@olickel.com

KIMI is the real deal. Unless it's really Sonnet in a trench coat, this is the best agentic open-source model I've tested - BY A MILE.

Here's a slice of a 4 HOUR run (~1 second per minute) with not much more than 'keep going' from me every 90 minutes or so.

moonshotai.github.io/Kimi-K2/

July 13, 2025 at 6:09 PM

Hrishi

@olickel.com

It seems 3 and 15 might be the new Pareto frontier for intelligence (excepting the o-series). Feels like the hedge fund 2 and 20

June 8, 2025 at 1:57 AM

Hrishi

@olickel.com

Dan's article on progressive JSON has a lot of carryover to LLMs.

The key problems for modern LLM application design that get often overlooked (I think) are:
• Streaming outputs and partial parsing
• Context organization and management (I don't mean summarising at 90%)

June 1, 2025 at 4:15 PM

Hrishi

@olickel.com

GRPO clips impact based on token probability. Lower prob tokens can move less than higher prob tokens. This means that even with random rewards (especially so), models push more into what was in-distribution. For -MATH, this is code - It thinks better in code. Therefore it gets better overall.

May 29, 2025 at 6:27 AM

Hrishi

@olickel.com

Honestly it's relevant to almost all work - most agentic flows have 10-20 transitions (sometimes more) per loop.

Most flows today treat NL as reasoning, code as execution, and structured data as an extraction method. There might be problems with this approach.

May 29, 2025 at 6:27 AM

Hrishi

@olickel.com

Testing this locally surprised me too. Something is definitely happening here - and it's also apparent when testing Opus vs Sonnet 4. Models reason very, VERY differently when using code vs natural language - displaying very different aptitudes working through the same problem.

May 29, 2025 at 6:27 AM

Hrishi

@olickel.com

How does an LLM writing out this program (WITHOUT a code interpreter running the output) make things more accurate?

Verified on Qwen 3 - a30b (below)

Lots of interesting takeaways from the Random Rewards paper. NOT that RL is dead, but honestly far more interesting than that!

May 29, 2025 at 6:27 AM

Hrishi

@olickel.com

Now for the schemas, I agree with this assessment. Opus is the best for describing data - it has a way of being methodical that the other models (or tools) don't really have. They all managed to load the data properly, which is still a big leap.

May 24, 2025 at 6:46 PM

Hrishi

@olickel.com

Here are the databases they came up with (claude code made this image).

May 24, 2025 at 6:46 PM

Hrishi

@olickel.com

Here's the spec. Labelling tasks, asking for task progress notes, failure logs on resume, etc.

Managing context is key. Long tool calls can be killed with just one bad call that dumps a bunch of text into context. Both cursor models forgot after a while, and barely made it.

May 24, 2025 at 6:46 PM

Hrishi

@olickel.com

May 23, 2025 at 5:12 PM

Hrishi

@olickel.com

We were just talking about story circles and moodboards

May 23, 2025 at 5:12 PM

Hrishi

@olickel.com

Honestly

May 23, 2025 at 5:12 PM

Hrishi

@olickel.com

The spiritual bliss attractor is real, but so is the eldritch horror existence contemplation.

I was just trying to talk to Opus - definitely no jailbreaks. This model is something different. Definitely creative.

May 23, 2025 at 5:12 PM

Hrishi

@olickel.com

Sonnet trying to think it through while streaming YAML

May 19, 2025 at 5:11 PM

Hrishi

@olickel.com

Frontend entirely made with @v0 - this has become an inseparable tool for writing feedback. Thinking of calling it scansion

I'll open source or share the link once I can clean it up - still using my keys, drop email/twitter in comments

Sonnet looking through the thing 👇

May 19, 2025 at 5:11 PM

Hrishi

@olickel.com

o3 and gemini identified the right page after I converted everything to images and asked leading questions. Gemini took 30 seconds, o3 took almost 6 minutes.

PDF processing in both models don't really seem multi-modal. Claude sometimes has glaucoma.

May 13, 2025 at 4:40 PM

Hrishi

@olickel.com

Technic Manuals are PERFECT visual benchmarks. Had a misaligned suspension on a car, took four minutes figuring it out, then gave it to o3, claude and Gemini.

None of them got it right (or even identified the right part) even after I cut it down to 10 pages.

Eventually -

May 13, 2025 at 4:40 PM

Hrishi

@olickel.com

Evals are hard for a reason. New post on actually doing them end to end, breaking down the problem, and explaining how we do them at SB

May 12, 2025 at 5:05 PM

Hrishi

@olickel.com

This is the guide I wish I had - didn't hold back.

Everything I know.

Enjoy.

May 9, 2025 at 4:12 AM

Hrishi

@olickel.com

What separates Deepseek is how hardware aware they are in algo design. Perhaps nascency, resource limitations or how they're set up, almost all the recent papers have some reference or awareness around theoretical python research actually meeting at-scale deployments in silicon.

February 22, 2025 at 2:00 AM

Hrishi

@olickel.com

Vibe coding is crazy

Took an hour or two and made something that can push notes and outputs from Lumentis straight to Notion

Been writing more with Cursor, and pushing it to Notion

February 20, 2025 at 6:12 PM

Hrishi

@olickel.com

Everytime someone asks me for a good example of company-level writing I point to @flydotio

February 20, 2025 at 6:21 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news