Kush Jain
kjain14.bsky.social
Kush Jain
@kjain14.bsky.social
SE PhD Student at Carnegie Mellon University interested in NLP for software engineering, program analysis and software testing. Former intern at Facebook AI Research.
(5/6) Sampling does not solve this problem either. For test completion pass@k tends to plateau at 90%, and for test suite generation even with extensive sampling, coverage values remain low!
December 19, 2024 at 8:59 PM
(4/6) We analyze errors from top models, finding that even current state-of-the-art models struggle with hallucination and reasoning about execution.
December 19, 2024 at 8:59 PM
(3/6) Models also struggle with test completion, with top models only achieving 63.5% pass@5 for our first test completion setting (coverage improvement is also low at 26.9%).
December 19, 2024 at 8:59 PM
(2/6) Current state-of-the-art models struggle with test suite generation. Even the best model, GPT-4o, only gets 35.2% coverage on TestGenEval.
December 19, 2024 at 8:59 PM
(1/6) TestGenEval is sourced from large scale Python repositories and targets real-world usecases: test authoring simulates a developer writing a test suite from scratch, while test completion mimics a developer aiming to improve the coverage of an existing test suite.
December 19, 2024 at 8:59 PM