Cohere matches or outperforms its competitors across agentic enterprise tasks via Invisible evaluations.
We appreciated Invisible’s ability to stand up a workforce quickly, pivot when needed, and deliver high-quality data that consistently improved model performance.
To optimize Command A, we needed to understand how well it performed in enterprise scenarios, such as customer service or HR queries. We weren’t just looking for just accuracy in responses, but nuance — did the model understand tone, context, ambiguity? We needed smart, consistent, scalable human evaluations to tell us that.
At the highest levels of leadership, Invisible has become synonymous with quality.
We had partnered with Invisible previously, to train our Command R model for hallucination reduction, and consider them a valued partner that helps us win in the marketplace. The Invisible team is so passionate when it comes to improving our models. We’ve trusted them with critical challenges, and their devotion to quality ensures we can develop a best-in-class model.
They maintain a really high bar for talent, with continuous observability that ensures we can trust the data. And they’re not afraid to challenge us, posing really complex questions that push us to create better data.
The result was that Command A is as good as, and in some cases much better than, its competitors at consistently answering in the requested language. For example, take Arabic dialects–its ADI2 score ((a human evaluation metric) achieved a 9-point lead over GPT-4o and DeepSeek-V3.
In head-to-head human evaluation across business, STEM, and coding tasks, Command A matches or outperforms its larger competitors, while offering superior throughput and increased efficiency. Command A excels on business-critical agentic and multilingual tasks, while being deployable on just two GPUs, compared to other models that typically require as many as 32. Human evaluations drove this success because they test on real-world enterprise data and situations.
Head-to-head human evaluation win-rates on enterprise tasks. All examples are blind-annotated by Invisible human annotators, assessing enterprise-focused accuracy, instruction following, and style. Throughput comparisons are between Command A on the Cohere platform, GPT-4o and Deepseek-V3 (TogetherAI) as reported by Artificial Analysis.
The deep partnership with Invisible stood out—they felt like part of our team and consistently went beyond what we asked for.