Tag Archives: rule-base translation

Why it’s worth it to engage in rule-based translation

Rule-based translation is difficult to implement. The main difficulty encountered is taking into account the groups of words, so as to be on a par with statistics-based translation. The main problems in this regard are (i) polymorphic disambiguation; and (ii) building a fair typology of grammatical types. But once these steps begin to be mastered, there are many advantages. What seems essential here is that with the same piece of software, both machine translation and text analysis can be carried out. Among the modules that are easy to implement are the following:

  • lemmatizer
  • part-of-speech tagger
  • singularizer
  • pluralizer
  • grammar checker
  • type extractor: a module that allows you to extract words from a text according to their grammatical category

For the implementation of rule-based translation provides the machine with some inherent understanding of the text, in the same way that a human being does. To put it in a nutshell, it is better artificial intelligence.

Finally, other modules, more advanced, seem possible (to be confirmed).

Word sense disambiguation: a hard case

Let us consider a hard case for word sense disambiguation, in the context of French to Corsican MT. But the same goes for French to English MT. It relates to French words such as: ‘accomplit’, ‘affaiblit’, ‘affranchit’, ‘alourdit’, ‘amortit’. The corresponding verbs ‘accomplir’ (to fulfill, to accomplish), ‘affaiblir’ (to weaken), ‘affranchir’ (to free), ‘alourdir’ (to burden), ‘amortir’ (to damp) have the same word for simple present and simple past at the third person singular: respectively ‘accomplit’, ‘affaiblit’, ‘affranchit’, ‘alourdit’, ‘amortit’. The upshot is that a single sentence such as: ‘Il affaiblit sa position.’ can be translated either into he weakens his position or into he weakened his position. If the context is unambiguous with regard to the sence of the discourse, the correct tense can be adequately chosen. But in the lack of informative context, it would be opportune to let the ambiguity prevail.

It should be pointed out that any such verbs are not rare. A more complete list includes: accomplit, affaiblit, affranchit, alourdit, amortit, anéantit, anoblit, aplatit, arrondit, assombrit, bannit, bâtit, blanchit, blondit, démolit, éblouit, emplit, enfouit, enhardit, enlaidit, ennoblit, envahit, épaissit, étourdit, exclut, franchit, glapit, investit, jaunit, jouit, munit, noircit, obéit, obscurcit, occit, périt, réagit, régit, réjouit, remplit, répartit, resplendit, rétrécit, rit, rougit, rouvrit, saisit, sévit, surgit.


Why rule-based translation is (presently) best suited to endangered languages

Here are some arguments in favor of the choice of rule-based translation concerning machine translation of endangered languages (it relates to the philosophy of language policy):

  • there does not exist at present time a reliable corpus between the given endangered language and other languages
  • endangered languages are often polynomic, i.e. there exist some main variants of the language that coexist: it is important to preserve them since (i) it is a feature of diversity and (ii) it is an inherent feature of the given endangered language, and to distinguish between these variants. In addition, any translation should not contain a mix up of these variants. This also complicates the process of building a proper corpus, since the scarce existing corpus is made up of different variants of the language.
  • in the lack of an adequate corpus, statistical machine translation is not able to provide quality translation of the given endangered language (while on the other hand it succeeds with common languages where excellent corpora are available): arguably, providing low quality translation (although the attempt is meritable) could harm these endangered languages that are by definition vulnerable, since people could use and diffuse the resulting low quality translation. On those grounds, given this vulnerability, it could be argued that a minimum 80% quality translation is needed for a given pair involving an endangered language.
  • in addition, it should be pointed out that endangered languages are usually in a ‘diglossic’ relationship with another language: what is needed as a matter of priority is to provide translation between the two languages of this pair

(to be continued)