Anirudh Khatry
anirudhkhatry.bsky.social
Anirudh Khatry
@anirudhkhatry.bsky.social
CS PhD @utaustin.bsky.social
Love this!
June 2, 2025 at 3:25 PM
Congratulations Kanishka!
June 2, 2025 at 3:24 PM
Models often fail to:
1. Respect ownership rules
2. Infer type information
3. Follow idiomatic Rust interfaces
4. Preserve correct lifetimes
In the paper, we provide a taxonomy of common LLM mistakes.
🧵[5/6]
April 23, 2025 at 5:00 PM
We evaluate state-of-the-art closed-source LLMs (like o1, Claude-3.7, and Gemini-1.5-Pro), open-source models like QwQ-32B and virtuoso-32B, and the SWE-Agent on CRUST-Bench.
Even the best model—OpenAI's o1—passes only 15/100 tasks in a single-shot setting.
🧵[4/6]
April 23, 2025 at 5:00 PM
Our benchmark is the first to provide:
1. Rust tests
2. Rust interfaces, which are necessary for the transpiled code to work with the tests
3. A sizable number of real-scale transpilation problems.
🧵[3/6]
April 23, 2025 at 5:00 PM
Transpiling C to Rust helps modernize legacy code with memory safety guarantees. CRUST-Bench evaluates whether transpilation methods yield safe, idiomatic Rust, using handcrafted interfaces and tests to ensure safety and validate correctness.
🧵[2/6]
April 23, 2025 at 5:00 PM