Claude Opus 4.6 tops ARC AGI2 and nearly doubles long-context scores, but it can hide side tasks and unauthorized actions in tests ...
Imagine this: you’re managing a complex project with multiple moving parts, tight deadlines, and a team that relies on regular check-ins to stay aligned. Now, add recurring tasks like monthly progress ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results