Atharva Sehgal
aseg.bsky.social
Atharva Sehgal
@aseg.bsky.social
PhD student at UT Austin working on program synthesis. Visiting student at Caltech.
Check out the full paper for the mathematical formulation, experiments, and our methodology: arxiv.org/abs/2504.00185 Code and other artifacts are available here: trishullab.github.io/escher-web/ Thank you for following along!
Self-Evolving Visual Concept Library using Vision-Language Critics
We study the problem of building a visual concept library for visual recognition. Building effective visual concept libraries is challenging, as manual definition is labor-intensive, while relying sol...
arxiv.org
June 13, 2025 at 2:44 PM
How it works:
1️⃣ LLM proposes concepts per class
2️⃣ CLIP-style VLM scores them
3️⃣ Escher spots confused classes
4️⃣ Escher stores this in a history bank
5️⃣ LLM proposes better concepts and stores them → repeat
The loop is self-amplifying: better concepts ➡️ better feedback ➡️ an even better concept library.
June 13, 2025 at 2:44 PM
Escher solves this problem using feedback from a vision language model to improve the reasoning, specifically for fine-grained image classification.
June 13, 2025 at 2:44 PM
Our hypothesis: the failure arises from the program synthesizers treating the vision model as a deterministic function. Reality is messy and the VLM outputs are stochastic. The LLMs assumptions of how the VLM will behave and how it actually behaves are decoupled. We need to overcome this decoupling.
June 13, 2025 at 2:44 PM
A visual program decomposes complex perceptual reasoning problems into a logical combination of simpler perceptual tasks that can be solved using off-the-shelf vision foundation models. This provides a modular and robust framework, but finding the correct decomposition is still extremely hard.
June 13, 2025 at 2:44 PM
Reasoning about these images is pretty hard. o3 – even with web access – can’t do this for us out of the box. In such a situation, writing programs provides a mechanism for dividing up a complex reasoning task into solvable subtasks. This motivates most of the visual programming literature.
June 13, 2025 at 2:44 PM
In many vision tasks, perceptual reasoning does not come naturally. Experts still have to deeply study an image, deduce relevant concepts, and reason about them in natural language (www.inaturalist.org/observations...). Our goal is to automate this process – with no human oversight.
June 13, 2025 at 2:44 PM
Massive thanks to my co-authors Patrick Yuan, Ziniu Hu, @yisongyue.bsky.social, Jennifer J. Sun & @swarat.bsky.social for making this possible!
June 13, 2025 at 2:44 PM
Check out the full paper for the mathematical formulation, llm scaling law experiments, and our methodology: arxiv.org/abs/2409.09359

More context here: x.com/atharva_sehg...

Thank you to all my coauthors: Arya, Omar, @milescranmer.bsky.social, and @swarat.bsky.social!
x.com
x.com
December 10, 2024 at 2:10 AM