StartupsThe Decoder·July 3, 2026

UK's AI Security Institute finds standard benchmarks systematically underestimate what AI agents can actually do

In a study covering seven benchmarks, the UK's AI Security Institute shows that standard AI evaluations systematically underestimate agent capabilities by capping the compute budget. On software engineering tasks, success rates jumped about 25 percent when the token budget was increased tenfold. Newer models benefit the most. Depending on the token budget, actual progress at the frontier is about…

This is a summary curated by AIFuture. Read the complete article at the original source:

Read the full story on The Decoder

More AI News

AI ResearchMarkTechPost

How to Build an End-to-End OCR Pipeline with Baidu’s Unlimited-OCR for High-Resolution Images and Multi-Page PDF Parsing

In this tutorial, we build a complete workflow for running Baidu’s Unlimited-OCR model on document images and multi-page PDFs. From configuring the GPU environment to comparing high-detail tiled Gundam inference and faster Base modes, you'll learn how to process dense layouts, tables, and cross-page content in a reproducible, end-to-end pipeline. The post How to Build an End-to-End OCR Pipeline…

Jul 24, 2026

AI ResearchTechCrunch

How AI guardrails are impeding the work of offensive cybersecurity researchers

We spoke with several cybersecurity researchers, who look for unknown vulnerabilities and develop tools to exploit them, about how OpenAI’s and Anthropic’s guardrails affect their work.

Jul 24, 2026

ProductThe Verge

Alexa Plus is getting an AI update to handle more complicated instructions

Amazon is launching an update to its Alexa Plus assistant that will allow it to connect to smart home devices in new ways. With the update, which is currently in preview, Alexa Plus can link up with tech from Bosch, Delta, Ecovacs, iRobot, Yale Home, Whirlpool, Tapo, Eufy, and others, while automatically routing requests to the correct device. In an example shared by Amazon, a person with a…

Jul 23, 2026