Lightnews — Scholar-powered news

Kush Jain

@kjain14.bsky.social

SE PhD Student at Carnegie Mellon University interested in NLP for software engineering, program analysis and software testing. Former intern at Facebook AI Research.

Posts Replies Media Videos

Kush Jain

@kjain14.bsky.social

(6/6) Check out our preprint for more details: arxiv.org/abs/2410.00752 (w/Gabriel Synnaeve and Baptiste Rozière)

Homepage: testgeneval.github.io
Sample Explorer: testgeneval.github.io/demo.html
Dataset: huggingface.co/datasets/kja...
Code: github.com/facebookrese...

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark

Code generation models can help improve many common software tasks ranging from code completion to defect prediction. Most of the existing benchmarks for code generation LLMs focus on code authoring o...

arxiv.org

December 19, 2024 at 8:59 PM

Kush Jain

@kjain14.bsky.social

(5/6) Sampling does not solve this problem either. For test completion pass@k tends to plateau at 90%, and for test suite generation even with extensive sampling, coverage values remain low!

December 19, 2024 at 8:59 PM

Kush Jain

@kjain14.bsky.social

(4/6) We analyze errors from top models, finding that even current state-of-the-art models struggle with hallucination and reasoning about execution.

December 19, 2024 at 8:59 PM

Kush Jain

@kjain14.bsky.social

(3/6) Models also struggle with test completion, with top models only achieving 63.5% pass@5 for our first test completion setting (coverage improvement is also low at 26.9%).

December 19, 2024 at 8:59 PM

Kush Jain

@kjain14.bsky.social

(2/6) Current state-of-the-art models struggle with test suite generation. Even the best model, GPT-4o, only gets 35.2% coverage on TestGenEval.

December 19, 2024 at 8:59 PM

Kush Jain

@kjain14.bsky.social

(1/6) TestGenEval is sourced from large scale Python repositories and targets real-world usecases: test authoring simulates a developer writing a test suite from scratch, while test completion mimics a developer aiming to improve the coverage of an existing test suite.

December 19, 2024 at 8:59 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news