Interview title

Audio description for interview.

Wojciech Galuba

Director of Data & Evaluations

00:00

Overview

Cohere matches or outperforms its competitors across agentic enterprise tasks via Invisible evaluations.

The results

51.7%

Average win rate

91.5%

IFEval academic benchmark

92%

RepoQA coding benchmark

We appreciated Invisible’s ability to stand up a workforce quickly, pivot when needed, and deliver high-quality data that consistently improved model performance.

Wojciech Galuba

Director of Data & Evaluations

At a glance

Client profile

Cohere is a leading provider of foundation models, developing AI solutions tailored for enterprise-scale applications. Its latest model, Command A, is a state-of-the-art, and highly efficient, generative model for enterprises.

headquarters

Toronto, Ontario, Canada

Industry

Technology

Key Solutions

Invisible training and evaluations

The challenge

We needed to evaluate Command A to see if it delivers the right outcomes in specialized, real-world scenarios. Off-the-shelf benchmarks can signal that general model performance is good, while in reality it fails with niche use cases. We needed PhD level experts across a range of specialisms, including STEM, Math, SQL, and subject matter experts in HR, retail and aviation, for blind annotation.

The outcome

Cohere expanded into 10 languages with our expert annotators, with fine-tuning in rare programming languages to tackle specialized use cases, for transformative improvements in model performance.

Client Interview

Interview title

Audio description for interview.

Wojciech Galuba

Director of Data & Evaluations

00:00

Interview transcript

Q: Wojciech, tell us more about the specific problem you faced?

To optimize Command A, we needed to understand how well it performed in enterprise scenarios, such as customer service or HR queries. We weren’t just looking for just accuracy in responses, but nuance — did the model understand tone, context, ambiguity? We needed smart, consistent, scalable human evaluations to tell us that.

At the highest levels of leadership, Invisible has become synonymous with quality.

Jesse Willman

Sr. Product Ops Lead

Q: What made you think Invisible could make a difference?

We had partnered with Invisible previously, to train our Command R model for hallucination reduction, and consider them a valued partner that helps us win in the marketplace. The Invisible team is so passionate when it comes to improving our models. We’ve trusted them with critical challenges, and their devotion to quality ensures we can develop a best-in-class model.

Q: How did quality improve with Invisible?

They maintain a really high bar for talent, with continuous observability that ensures we can trust the data. And they’re not afraid to challenge us, posing really complex questions that push us to create better data.

The result was that Command A is as good as, and in some cases much better than, its competitors at consistently answering in the requested language. For example, take Arabic dialects–its ADI2 score ((a human evaluation metric) achieved a 9-point lead over GPT-4o and DeepSeek-V3.

Q: What was the commercial impact?

In head-to-head human evaluation across business, STEM, and coding tasks, Command A matches or outperforms its larger competitors, while offering superior throughput and increased efficiency. Command A excels on business-critical agentic and multilingual tasks, while‬ being deployable on just two GPUs, compared to other models that typically require as many as 32. Human evaluations drove this success because they test on real-world enterprise data and situations.

Head-to-head human evaluation win-rates on enterprise tasks. All examples are blind-annotated by Invisible human annotators, assessing enterprise-focused accuracy, instruction following, and style. Throughput comparisons are between Command A on the Cohere platform, GPT-4o and Deepseek-V3 (TogetherAI) as reported by Artificial Analysis.

The deep partnership with Invisible stood out—they felt like part of our team and consistently went beyond what we asked for.

Wojciech Galuba

Director of Data & Evaluations

By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

Preferences

Reject Accept

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Cohere outperforms competitors in agentic enterprise tasks with Invisible evaluations

Featuring Wojciech Galuba, Director of Data & Evaluations at Cohere

Overview

The results

At a glance

The challenge

The outcome

Client Interview

Interview transcript

Q: Wojciech, tell us more about the specific problem you faced?

Q: What made you think Invisible could make a difference?

Q: How did quality improve with Invisible?

Q: What was the commercial impact?