Evaluation of machine translation: why not self-evaluation?

Evaluation of machine translation is usually done via external tools (to cite some instances: ARPA, BLEU, METEOR, LEPOR, …). But let us investigate the idea of self-evaluation. For it seems that the software itself is capable of having an accurate idea of its possible errors.

In the above example, human evaluation yields a score of 1 – 5/88 = 94.31%. Contrast with self-evaluation which sums its possible errors: unknown words and disambiguation errors, thus entailing a self-evaluation of 92,05%, due to 7 hypothesized errors. In this case, self-evaluation computes the maximum error rate. But even here, there are some false positives: ‘apellation’ is left untranslated, being unrecognized. In effect, the correct spelling is ‘appellation’. To sum up: the software identifies an unknown word (and lefts it untranslated) and counts it as a possible error.

Let us sketch what could be the pros and cons of MT self-evaluation. To begin with, the pros:

  • it could provide a detailed taxonomy of possible errors: unknown words, unresolved grammatical disambiguation, unresolved semantical disambiguation, …
  • it could identify precisely the suspected errors
  • evaluation would be very fast and uncostly
  • self-evaluation would work with whatever text or corpus
  • self-evaluation could pave the way to further self-improvement and self-correction of errors
  • its reliability could be good

And the cons:

  • MT may be unaware of some types of errors, i.e. errors related to expressions and locutions
  • MT may be unaware of some types of errors, i.e. errors related to expressions and locutions
  • MT self-evaluation could be especially blind to grammatical errors
  • it would sometimes count as unknown words some foreign words that should remain untranslated
  • MT would be unaware of erroneous disambiguations