How Prompt Complexity Affects GPT-3.5 Mutation Generation Accuracy | HackerNoon
Briefly

The article investigates the efficacy of various large language models (LLMs) in generating code mutations for bug detection, primarily focusing on GPT-3.5, CodeLlama-13b, and Major. The study reveals that GPT-3.5 detected a remarkable 96.7% of Defects4J bugs and 86.7% of ConDefects bugs, offering the best results overall. Coupling rates from these models and implications of different experimental settings were analyzed, indicating that prompt selection significantly impacts mutation generation performance, while acknowledging validity concerns regarding dataset biases and representation limitations.
The comparisons reveal that GPT-3.5 excels in bug detection with the highest rates on Defects4J and ConDefects, showcasing its powerful mutation generation capabilities.
Employing novel mutation generation techniques, this study assesses several LLMs, revealing critical variances in performance and usability that highlight the strengths and limitations of each model.
Our findings advocate for a careful selection of prompts and settings, as they substantially influence the results and effectiveness of LLM-driven mutation generation processes.
Despite impressive results, we acknowledge inherent threats to validity, which encompass limitations in dataset representation and potential biases introduced during mutation generation.
Read at Hackernoon
[
|
]