OpenAI GeneBench-Pro: novel benchmark for AI agent judgment in messy biological research workflows
GeneBench-Pro and the Benchmark That Actually Matters Most AI benchmarks are designed to be solved. That’s the problem. You format the question cleanly, the model retrieves the right token sequence, the number goes up, the press release goes out. Meanwhile, anyone who has spent time in actual computational biology is quietly losing their mind, because…
