Tag Archives: translation corpus

How rule-based and statistical machine translation can help each other

Here are a few suggestions on how rule-based and statistical machine translation  can help each other:

To begin with, rule-based and statistical machine translation are often contrasted and compared: it would be oversimplifying to conclude that one is better than the other. From a more objective standpoint, let us consider that each method has its strengths and weaknesses. Let us investigate on how one could make them collaborate in order to add up their respective strengths

in the case of an endangered language, the lack of good quality corpora has been pointed out. But one way for rule-based and statistical machine translation to collaborate would be to use rule-based translation for building a better quality corpus for statistical machine translation

suppose we begin with a statistical machine translation software that performs 50% on average with regard to French to Corsican translation

let us sketch the process of creating these better corpora: let us take the example of the French-Corsican diglossic pair (the Corsican language being considered by Unesco as a definitely endangered language). Now presently we lack a quality French-Corsican corpus or to say it more accurately, the corpus at our disposal is a low-quality one. The idea would be to use rule-based machine translation to create a much better corpus to use with statistical machine translation.

let us sketch now the different steps of this collaborative process: (i) create a French-Corsican corpus with the help of rule-based machine translation: if the software has some average 90% performance, then the corpus would be on average 90% reliable. With appropriate training, statistical MT should now perform some, say, 80% on average (to be compared with the previous 50% performance)
(ii) from this French-Corsican corpus, other corpora pairs can be created, such as Italian-Corsican, English-Corsican, etc. since French-Italian, English-Italian, etc. corpora of excellent quality already exist. The performance gain should then extend to other language pairs such as Italian-Corsican, English-Corsican, etc.

with the help of this process, we re finally in a position to combine and add up the strengths of the two complementary approaches to MT: on the one hand, rule-based MT is able to translate with good accuracy even in the lack of corpora; on the other hand,┬ástatistical machine translation is able to handle successfully and fastly a great many language pairs. To sum up, as the Corsican proverb says: una mani lava l’altra (One hand washes the other).

Why rule-based translation is (presently) best suited to endangered languages

Here are some arguments in favor of the choice of rule-based translation concerning machine translation of endangered languages (it relates to the philosophy of language policy):

  • there does not exist at present time a reliable corpus between the given endangered language and other languages
  • endangered languages are often polynomic, i.e. there exist some main variants of the language that coexist: it is important to preserve them since (i) it is a feature of diversity and (ii) it is an inherent feature of the given endangered language, and to distinguish between these variants. In addition, any translation should not contain a mix up of these variants. This also complicates the process of building a proper corpus, since the scarce existing corpus is made up of different variants of the language.
  • in the lack of an adequate corpus, statistical machine translation is not able to provide quality translation of the given endangered language (while on the other hand it succeeds with common languages where excellent corpora are available): arguably, providing low quality translation (although the attempt is meritable) could harm these endangered languages that are by definition vulnerable, since people could use and diffuse the resulting low quality translation. On those grounds, given this vulnerability, it could be argued that a minimum 80% quality translation is needed for a given pair involving an endangered language.
  • in addition, it should be pointed out that endangered languages are usually in a ‘diglossic’ relationship with another language: what is needed as a matter of priority is to provide translation between the two languages of this pair

(to be continued)