Cohere outperforms competitors in agentic enterprise tasks with Invisible evaluations

Featuring Wojciech Galuba, Director of Data & Evaluations at Cohere

Interview title
Audio description for interview.
00:00
/
00:00

Overview

Cohere matches or outperforms its competitors across agentic enterprise tasks via Invisible evaluations.

The results

51.7%
Average win rate
91.5%
IFEval academic benchmark
92%
RepoQA coding benchmark
We appreciated Invisible’s ability to stand up a workforce quickly, pivot when needed, and deliver high-quality data that consistently improved model performance.

At a glance

Client profile
Cohere is a leading provider of foundation models, developing AI solutions tailored for enterprise-scale applications. Its latest model, Command A, is a state-of-the-art, and highly efficient, generative model for enterprises.
headquarters
Toronto, Ontario, Canada
Industry
Technology
Key Solutions
Invisible training and evaluations

The challenge

We needed to evaluate Command A to see if it delivers the right outcomes in specialized, real-world scenarios. Off-the-shelf benchmarks can signal that general model performance is good, while in reality it fails with niche use cases. We needed PhD level experts across a range of specialisms, including STEM, Math, SQL, and subject matter experts in HR, retail and aviation, for blind annotation.

The outcome

Cohere expanded into 10 languages with our expert annotators, with fine-tuning in rare programming languages to tackle specialized use cases, for transformative improvements in model performance.

Client Interview

Interview title
Audio description for interview.
00:00
/
00:00

Interview transcript

Q: Wojciech, tell us more about the specific problem you faced? 

To optimize Command A, we needed to understand how well it performed in enterprise scenarios, such as customer service or HR queries. We weren’t just looking for just accuracy in responses, but nuance — did the model understand tone, context, ambiguity? We needed smart, consistent, scalable human evaluations to tell us that. 

At the highest levels of leadership, Invisible has become synonymous with quality.

Q: What made you think Invisible could make a difference?

We had partnered with Invisible previously, to train our Command R model for hallucination reduction, and consider them a valued partner that helps us win in the marketplace. The Invisible team is so passionate when it comes to improving our models. We’ve trusted them with critical challenges, and their devotion to quality ensures we can develop a best-in-class model. 

Q: How did quality improve with Invisible? 

They maintain a really high bar for talent, with continuous observability that ensures we can trust the data. And they’re not afraid to challenge us, posing really complex questions that push us to create better data. 

The result was that Command A is as good as, and in some cases much better than, its competitors at consistently answering in the requested language. For example, take Arabic dialects–its ADI2 score ((a human evaluation metric) achieved a 9-point lead over GPT-4o and DeepSeek-V3. 

Q: What was the commercial impact? 

In head-to-head human evaluation across business, STEM, and coding tasks, Command A matches or outperforms its larger competitors, while offering superior throughput and increased efficiency. Command A excels on business-critical agentic and multilingual tasks, while‬ being deployable on just two GPUs, compared to other models that typically require as many as 32. Human evaluations drove this success because they test on real-world enterprise data and situations.

Head-to-head human evaluation win-rates on enterprise tasks. All examples are blind-annotated by Invisible human annotators, assessing enterprise-focused accuracy, instruction following, and style. Throughput comparisons are between Command A on the Cohere platform, GPT-4o and Deepseek-V3 (TogetherAI) as reported by Artificial Analysis.

The deep partnership with Invisible stood out—they felt like part of our team and consistently went beyond what we asked for.