A new benchmark evaluates the freelancing automation of AI

The Remote Labor Index, a new benchmark that evaluates AI models on actual freelance assignments, was released by Scale AI and the Center for AI Safety. It shows that even the best systems do fewer than 3% of tasks at professional human standards. The Specifics: The benchmark gathered 240 finished tasks, including task deliverables, from certified Upwork pros in 23 different work categories. The identical projects were used to test six systems, and the AI results were compared to the Upwork submission's professional criteria. Nearly 97% of outputs failed to meet fundamental client standards, with Manus topping the list at 2.5%, followed by Grok 4 and Claude Sonnet 4.5 at 2.1%. Poor quality, partial deliverables, and broken files were among the problems, and AI was only successful on specific tasks including creating charts, audio mixing, and logos. The disparity between real-world automation and benchmark hype has just been measured. These findings demonstrate that, despite improvements in reasoning scores, organizing complicated outputs is still beyond the capabilities of existing AI. A human in the loop is still crucial, even though agents may be working on smaller subtasks (at least for the time being).