“we introduce SAFE, a novel framework that overcomes the lack of human-written proof […] [SAFE achieves] a 70.50% accuracy rate in a benchmark crafted by human experts, [vs] GPT-4o's performance of 24.46%”
arxiv.org/abs/2410.15756
“we introduce SAFE, a novel framework that overcomes the lack of human-written proof […] [SAFE achieves] a 70.50% accuracy rate in a benchmark crafted by human experts, [vs] GPT-4o's performance of 24.46%”
arxiv.org/abs/2410.15756
“… we explore the ability of LLMs to produce verified code in three verification languages (Dafny, Nagini, and Verus) […] we use manually curated datasets derived from the state-ofthe-art Python benchmark, HumanEval”
arxiv.org/abs/2503.14183
“… we explore the ability of LLMs to produce verified code in three verification languages (Dafny, Nagini, and Verus) […] we use manually curated datasets derived from the state-ofthe-art Python benchmark, HumanEval”
arxiv.org/abs/2503.14183
“Zac Hatfield-Dodds [argues] that relying solely on verification methods may not provide real AI safety”
youtu.be/bs5snugP1VA?...
“Zac Hatfield-Dodds [argues] that relying solely on verification methods may not provide real AI safety”
youtu.be/bs5snugP1VA?...
“We introduce the Formally Verified Automated Programming Progress Standards, or FVAPPS, a benchmark of 4715 samples […] including 1083 curated and quality controlled samples”
arxiv.org/abs/2502.05714
“We introduce the Formally Verified Automated Programming Progress Standards, or FVAPPS, a benchmark of 4715 samples […] including 1083 curated and quality controlled samples”
arxiv.org/abs/2502.05714
Exponentiating Mathematics (expMath) aims to accelerate the rate of progress in pure math through the development of an AI collaborator and new professional-level math benchmarks.
sam.gov/opp/4def3c13...
Exponentiating Mathematics (expMath) aims to accelerate the rate of progress in pure math through the development of an AI collaborator and new professional-level math benchmarks.
sam.gov/opp/4def3c13...
"[We combine] LLMs with static analysis to perform whole-repository reasoning for security vulnerability detection. [...] IRIS leverages LLMs to infer taint specifications and perform contextual analysis"
arxiv.org/abs/2405.17238
"[We combine] LLMs with static analysis to perform whole-repository reasoning for security vulnerability detection. [...] IRIS leverages LLMs to infer taint specifications and perform contextual analysis"
arxiv.org/abs/2405.17238
"AlphaVerus [is] a self-improving framework that bootstraps formally verified code generation by iteratively translating programs from a higher-resource language
arxiv.org/abs/2412.06176
"AlphaVerus [is] a self-improving framework that bootstraps formally verified code generation by iteratively translating programs from a higher-resource language
arxiv.org/abs/2412.06176
"This workshop explores the intersection of scale-driven generative artificial intelligence (AI) and the correctness-focused principles of verification."
verifai-workshop.github.io
"This workshop explores the intersection of scale-driven generative artificial intelligence (AI) and the correctness-focused principles of verification."
verifai-workshop.github.io
"...we propose Laurel, a tool that uses LLMs to automatically generate helper assertions for Dafny [...] Laurel is able to generate over 50% of the required helper assertions given only a few attempts"
arxiv.org/abs/2405.16792
"...we propose Laurel, a tool that uses LLMs to automatically generate helper assertions for Dafny [...] Laurel is able to generate over 50% of the required helper assertions given only a few attempts"
arxiv.org/abs/2405.16792
arxiv.org/abs/2210.12283
arxiv.org/abs/2210.12283
aws.amazon.com/blogs/aws/pr...
aws.amazon.com/blogs/aws/pr...
drive.google.com/file/d/1ybQx...
drive.google.com/file/d/1ybQx...
arxiv.org/abs/2405.01787
arxiv.org/abs/2405.01787