Lightnews — Scholar-powered news

Atharva Sehgal

@aseg.bsky.social

73 followers 420 following 12 posts

PhD student at UT Austin working on program synthesis. Visiting student at Caltech.

Posts Replies Media Videos

Atharva Sehgal

@aseg.bsky.social

How it works:
1️⃣ LLM proposes concepts per class
2️⃣ CLIP-style VLM scores them
3️⃣ Escher spots confused classes
4️⃣ Escher stores this in a history bank
5️⃣ LLM proposes better concepts and stores them → repeat
The loop is self-amplifying: better concepts ➡️ better feedback ➡️ an even better concept library.

June 13, 2025 at 2:44 PM

Atharva Sehgal

@aseg.bsky.social

Our hypothesis: the failure arises from the program synthesizers treating the vision model as a deterministic function. Reality is messy and the VLM outputs are stochastic. The LLMs assumptions of how the VLM will behave and how it actually behaves are decoupled. We need to overcome this decoupling.

June 13, 2025 at 2:44 PM

Atharva Sehgal

@aseg.bsky.social

A visual program decomposes complex perceptual reasoning problems into a logical combination of simpler perceptual tasks that can be solved using off-the-shelf vision foundation models. This provides a modular and robust framework, but finding the correct decomposition is still extremely hard.

Even with visual programming, the LLM proposing the program has no idea about the execution semantics of the underlying VLM. Things still don't work.

June 13, 2025 at 2:44 PM

Atharva Sehgal

@aseg.bsky.social

Reasoning about these images is pretty hard. o3 – even with web access – can’t do this for us out of the box. In such a situation, writing programs provides a mechanism for dividing up a complex reasoning task into solvable subtasks. This motivates most of the visual programming literature.

gpt-o3, which has probably seen this image before, reasons incorrectly about the type of lizard and gets it wrong. Visual feedback is extremely important here!

June 13, 2025 at 2:44 PM

Atharva Sehgal

@aseg.bsky.social

In many vision tasks, perceptual reasoning does not come naturally. Experts still have to deeply study an image, deduce relevant concepts, and reason about them in natural language (www.inaturalist.org/observations...). Our goal is to automate this process – with no human oversight.

An example from inaturalist of two scientist deliberating how to classify a rare lizard. The first scientists gets it wrong because they aren't trained as a herpetologist. The second scientist is a trained herpetologist, and reasons in natural language how to correctly identify the image.

June 13, 2025 at 2:44 PM

Atharva Sehgal

@aseg.bsky.social

I’m presenting Escher (trishullab.github.io/escher-web) at #cvpr2025 Saturday morning (Poster Session #3). Escher builds a visual concept library with a vision‑language critic (no human labels needed). Swing by if you’d like to chat about program synthesis & multimodal reasoning!

June 13, 2025 at 2:44 PM

Atharva Sehgal

@aseg.bsky.social

Just julia things.

February 13, 2025 at 5:39 AM

Atharva Sehgal

@aseg.bsky.social

Arya and I'll be at #NeurIPS presenting LaSR (trishullab.github.io/lasr-web/) on Wednesday morning 11AM PST to 2PM PST (East Exhibit Hall A-C #4003). Drop by and say Hi!

December 10, 2024 at 2:04 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news