WORKFORCE AUTOMATION
Issue #474
9 Nov 2025
Part of the reason why there is an evaluation vs outcome gap on AI models is because the evaluations are usually abstract exams or singular tasks, rather than real world, in-context multi-task projects. So lets test some leading AI Agents on how well they perform on actual projects from Upwork. Headline stat is: only 2.5% of the projects were completed to production standard, with Manus being the leading agent. Great technical paper, easier to digest than you might first think. Download the PDF or view the website breakdown.