Shikhar Murty
shikharmurty.bsky.social
Shikhar Murty
@shikharmurty.bsky.social
Final year PhD Student in Computer Science @Stanford

Work on:
- Compositionality, syntax (language structure)
- Web Agents: Synthetic data, tree search, exploration (language interpretation)
“casual interception” as defined in \citep{}…
February 14, 2025 at 11:41 PM
controlling a browser / computer!
but requires a bit more tooling to set it up.
February 6, 2025 at 7:00 PM
Please check out our paper for more details: arxiv.org/pdf/2410.02907

And our code if you want a NNetNav-ed model for your own domain:
github.com/MurtyShikhar...

Done with collaborators: @zhuhao.me, Dzmitry Bahdanau and @chrmanning.bsky.social
arxiv.org
February 6, 2025 at 5:43 PM
We find that cross-website robustness is limited, and almost always, performance goes up from incorporating in-domain nnetnav data. This makes it even more important to work on unsupervised learning for agents - how are you going to collect human data for *any* website? [6/n]
February 6, 2025 at 5:43 PM
We use this data for SFT-ing LLama3.1-8b. Our best models outperform zero-shot GPT-4 on both WebArena and WebVoyager, and reach SoTA performance among unsupervised methods for both datasets [5/n]
February 6, 2025 at 5:43 PM
We use NNetNav to collect around 10k workflows for over 20 websites including 15 live websites, and 5 self-hosted websites.

Data is available on 🤗: huggingface.co/datasets/sta...
huggingface.co/datasets/sta...
[4/n]
stanfordnlp/nnetnav-live · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
February 6, 2025 at 5:43 PM
Main ideas behind NNetNav exploration
1 complex goals have intermediate subgoals thus complex trajectories must have meaningful sub-trajectories
2 Use an LM instruction relabeler + judge to test if trajectory-so-far is meaningful. If yes, continue exploring, otherwise prune [3/n]
February 6, 2025 at 5:43 PM
NNetNav uses a structured exploration method to efficiently search and collect traces on live-websites, which are retroactively labeled into instructions, finding a strikingly diverse set of workflows for any website (e.g. like this plot) [2/n]
February 6, 2025 at 5:43 PM
Now, reviewers are upset if we only finetune sub 10B parameter models!
November 26, 2024 at 10:28 PM
for more context: we are training the probe on sentences from PTB / BLIMP
November 25, 2024 at 5:52 AM
thx for sharing, though semantic parsing almost certainly benefits from modeling syntax :)
November 25, 2024 at 3:49 AM
SRL probe still rewards hidden states that model dependency relations, no? would like a probe thats agnostic to how well the underlying network models syntax
November 24, 2024 at 10:38 PM
could i get added? thx for making this!!
November 24, 2024 at 5:25 AM