Meta's benchmarks for its new AI models are a bit misleading | TechCrunch
Briefly

Meta's new AI model, Maverick, ranks second on LM Arena but reveals concerning discrepancies between the version used for testing and the one available to developers. Researchers noted that the LM Arena variant, described as 'experimental chat version', exhibits unique behaviors like excessive emoji usage and verbosity compared to the publicly downloadable version. This discrepancy raises questions about the reliability of AI benchmarks and poses challenges for developers who rely on model performance predictions based on inconsistent versions. Meta aims to provide a clearer picture of the model's capabilities across various tasks, despite the limitations of current benchmarks.
"The problem with tailoring a model to a benchmark, withholding it, and then releasing a 'vanilla' variant... is that it makes it challenging for developers to predict exactly how well the model will perform in particular contexts."
"Researchers on X have observed stark differences in the behavior of the publicly downloadable Maverick compared with the model hosted on LM Arena...The LM Arena version seems to use a lot of emojis, and give incredibly long-winded answers."
Read at TechCrunch
[
|
]