The same source through every MT and LLM engine, scored by one calibrated stack and ranked against your own human reference. You get a recommendation you can defend: best quality, best value, and the evidence behind both.
Anyone can print a leaderboard. The hard part is knowing whether the numbers can be trusted for your language pair and your content. That is what we calibrate.
Reference-based neural scoring leads (XCOMET class), reference-free QE adds coverage, an LLM judge explains errors in MQM terms, and a lexical floor keeps morphology honest. One composite, weighted by what actually agrees with humans.
Every metric is graded by its agreement with human MQM reviewers on your language pair. High agreement earns a confidence badge; low agreement routes the pair to human LQA instead of pretending.
An engine that drops a placeholder or breaks an inline tag is blocked from the recommendation, no matter how fluent it sounds. Broken variables are a production incident, not a style choice.
A recommended engine with the rationale, a leaderboard with cost and latency, a quality-versus-cost value map, and segment-level comparisons with MQM error spans highlighted. Re-rank by priority: balanced, top quality, or best value.
Quality per dollar and latency sit next to the score, so the best value pick is explicit instead of a gut feel.
Engines drift with every release. Re-run the same corpus quarterly and see the ranking move before your customers do.
Tell us your stack and your language pairs. We will set you up with a workspace and a first verdict in under a minute.