Making AI-Powered Mutation Testing Reliable and Fair | HackerNoon
Briefly

This article focuses on evaluating threats to the validity of results from mutation testing using Large Language Models (LLMs). The authors implement protocols to mitigate risks such as data leakage and experiment settings sensitivity. By utilizing widely recognized models and datasets like Defects4J and ConDefects, the research aims to enhance confidence in findings. Various evaluation metrics and baseline approaches are discussed to ensure comprehensive analysis, while the implications and error types of non-compilable mutations are also addressed, contributing to the understanding of LLM capabilities in this field.
We adopt the most widely studied models, popular programming languages, and datasets in our research to mitigate validity threats related to our findings.
Data leakage, particularly from Defects4J, poses a significant threat to the validity of our results, prompting the incorporation of the ConDefects dataset.
Our findings highlight the sensitivity to chosen experiment settings, emphasizing the importance of robust methodologies in mutation testing and LLM evaluations.
We hypothesize that tools may introduce exact matches with specific faults due to tuning, which could influence the outcomes of our study.
Read at Hackernoon
[
|
]