
""Each new Claude model has forced us to redesign the test," Hume writes. "When given the same time limit, Claude Opus 4 outperformed most human applicants. That still allowed us to distinguish the strongest candidates - but then, Claude Opus 4.5 matched even those.""
""Under the constraints of the take-home test, we no longer had a way to distinguish between the output of our top candidates and our most capable model," Hume writes."
"In the end, Hume designed a new test that had less to do with optimizing hardware, making it sufficiently novel to stump contemporary AI tools. But as part of the post, he shared the original test to see if anyone reading could come up with a better solution."
Since 2024 Anthropic's performance optimization team has used a take-home test to evaluate job applicants' skills. Rapid improvements in AI coding tools required frequent redesigns to prevent AI-assisted cheating. Each Claude model update increased automated performance: Claude Opus 4 outperformed most human applicants under the same time limit, and Claude Opus 4.5 matched the strongest candidates. Without in-person proctoring, take-home formats cannot reliably distinguish AI-assisted submissions from top human work. Anthropic created a new test focusing less on hardware optimization and more on novel problem types to stump current AI tools. The original test was shared publicly to solicit better solutions.
Read at TechCrunch
Unable to calculate read time
Collection
[
|
...
]