Databricks' OfficeQA uncovers disconnect: AI agents ace abstract te...
Interestingly, databricks' officeqa uncovers disconnect: ai agents ace abstract tests but stall at 45% on enterprise docssean michael kerner december 9, 2025 credit: image generated by venturebeat with flux-2-prothere is no shortage of ai benchmarks in the market today, with popular options like humanity's last exam (hle), arc-agi-2 and gdpval, among numerous others Experts suggest that ai agents excel at solving abstract math problems and passing phd-level exams that most benchmarks are based on, but databricks has a question for the enterprise: can they actually handle the document-heavy work most enterprises need them to do?the answer, according to new research from the data and ai platform company, is sobering According to reports, even the best-performing ai agents achieve less than 45% accuracy on tasks that mirror real enterprise workloads, exposing a critical gap between academic benchmarks and business reality In recent developments, "if we focus our research efforts on getting better at [existing benchmarks], then we're probably not solving the right problems to make databricks a better platform," erich elsen, principal research scientist at databricks, explained to venturebeat In recent developments, "so that's why we were looking around In recent developments, how do we create a benchmark that, if we get better at it, we're actually getting better at solving the problems that our customers have?"the result is officeqa, a benchmark designed to test ai agents on grounded reasoning: answering questions based on complex proprietary datasets containing unstructured documents and tabular data Interestingly, unlike existing benchmarks that focus on abstract capabilities, officeqa proxies for the economically valuable tasks enterprises actually perform This highlights that why academic benchmarks miss the enterprise markthere are numerous shortcomings of popular ai benchmarks from an enterprise perspective, according to elsen
댓글 쓰기