I put ChatGPT-4o and 5.1 through 9 real-world tests — from logic puzzles to coding, writing and image analysis.
But when my editor told me he vibe coded a Minesweeper remake, my wheels started spinning, and eventually went off the rails.