Refuel reposted this
LLM benchmarks can be misleading. So much so, that Anthropic and OpenAI are investing millions to try and address this challenge. The natural instinct is to pick the model with the highest eval % and call it a day, right? Not exactly. 1. Public benchmarks use datasets that are not reflective of common usage by consumers or enterprise users - check out MMLU and Hellaswag for yourself. 2. Moreover, a recent study from Surge AI found that a third of these datasets contain typos and “nonsensical” writing. 3. Additionally, there’s no way to tell if the LLM is actually reasoning, or merely regurgitating an answer that the model was previously trained on - resulting in contamination. The bottom line - No benchmarks are going to be reflective of YOUR data. For you to trust AI models on your data and tasks, you’ll have to create your own evaluation datasets. The value of benchmarks increases the more specific they are. For example, Anthropic just announced an initiative to fund the development of new types of benchmarks (cyber attacks, manipulation, deception etc.) In Refuel's case, we’ve developed use case specific and industry specific benchmarks, such as in financial services and retail (pictured below). We’ve worked with our customers to carefully construct benchmarks with significant involvement from domain experts that are as close to real world performance as possible. Is your business using the right evals?