Temps de lecture : 10 mn(s) | Stephen Tyler | 26 mai 2023 |
Translation quality scoring refers to the process of evaluating and assigning a numerical or qualitative score to assess the quality of a translation. It involves measuring the degree to which a translated text meets specific criteria or standards.
Scoring translation quality is desirable for several reasons:
The mechanisms to assess translation quality vary whether you are assessing a linguist or LSP service, assessing an NMT model, or assessing a specific translation produced by an NMT model, but the goal in all cases is to assess the translation quality on a set of dimensions including accuracy, fluency, cultural appropriateness, style, and terminology.
The most widely accepted and utilized method of translation quality assessment is the Multidimensional Quality Metrics (MQM) model. MQM defines a taxonomy of error types. Linguist reviewers annotate translations, marking up the errors with the relevant type and a severity level of Minor to Major to Critical. This annotation data can be then reduced to a quality score (0-100) by deducting from a perfect (100) score, points for each error that are based on weighting factors for each error type and the severity. Organizations can adjust the weights to tailor the scoring to care more or less about various aspects of quality. For instance, some organizations may want to focus mostly on accuracy and fluency with less concern about tone and style, others may care more about style or terminology dimensions.
This form of scoring can also be done by linguists to review the quality of output of Neural Machine Translation (NMT) models. While these are typically evaluated with other methods described below, it can be useful to benchmark those scoring methods against a linguist quality review of a representative sample to ensure correlation of automated metrics to human evaluation and support the ability to trust the automated evaluation metrics.
Machine Quality Evaluation and Quality Estimation are similar terms that mean quite different things.
Quality Evaluation is the process of evaluating a newly trained NMT model. This is done by reserving a small portion of the training data that will not be used to train the model, only to evaluate it. It constitutes a set of known good or "reference" translations. Once the model is trained using all the other data, it is then asked to produce translations for all the sentences in evaluation set. These translations are then programmatically compared with the known good reference translations in the evaluation set. BLEU is a commonly used metric that measures how close the model produced translations are to the reference translations. TER, F-Measure, COMET are other commonly used metrics for this that differ in the details of how the scoring is done but the serve the same purpose.
Quality Estimation is different, and more similar in purpose to traditional human quality scoring, in that it is assessing the quality of an individual translation as objectively as possible but without known good reference translations to compare against.
This is not a trivial problem to solve. It is incredibly hard for any algorithm or AI classifier to accurately assess all the dimensions of translation quality described earlier and be 100% right 100% of the time so that the machine evaluation correlates perfectly with human review.
In our labs at MotionPoint we have been testing various AI based QE models. These vary in approach from semantic similarity assessments utilizing vector embeddings produced by domain adapted embedding models, to using Large Language Models to do MQM style evaluations.
We are finding these models perform best at the extremes. They identify very bad translations well and rarely incorrectly tag translations as such. Similarly, they identify very good ones well and rarely incorrectly tag translations as excellent. They are less discerning in the middle ground, but the ability to perform well at the extremes nonetheless makes them very useful in a couple of ways, despite being far from perfect.
Say you are risk tolerant on translation quality generally and just really need translations to be basically accurate and intelligible. If you have a QE model that can at least identify the really bad translations (ie. inaccurate/unintelligble) with high accuracy, you can then process these by exception e.g. have a linguist do a post-edit. This is a more cost-effective way to manage the risk of bad translations, than sending everything for post-edit "just in case."
On the other end of the spectrum, say you have a high quality bar and MTPE is the default workflow for everything. To the extent that QE models can accurately identify at least some of the machine translations that are good enough to use as-is, this allows the post-editing to be skipped for these translations with significant cost savings, and a low risk of quality sacrifice.
In our lab testing we are finding that in datasets where the human evaluation indicated about 40% of translations did not need post-editing, that the QE models could reliably identify about 15% of the translations as not needing post-edit with high accuracy.
This is valuable if it can save 15% of post-editing costs in MTPE workflows. As technology improves and the 15% gets closer to 30% or more the cost savings will become extremely compelling.
At MotionPoint we have 20 years of history providing top quality translations to our customers and are investing in the technology platform necessary to deliver superior business outcomes, intelligently balancing the cost vs risk tradeoff as it relates to translation.
We are doing this by investing in our platform to ensure we deliver the best quality machine translation engines, and smart workflow options that utilize automated quality estimation tools to convert that quality into beneficial outcomes in terms of cost and risk for our customers.
Want to learn more about how we’re revolutionizing the translation industry by offering human-quality translations for up to 60% less? Download our recent webinar on website translation in light of AI for free! You can find it here.