What is required from Artificial General Intelligence with regard to Machine Translation?

We will be interested in a series of posts to try to define what is required of an AGI (Artificial General Intelligence) in order to reach the level of superintelligence in MT (machine translation). (All this is highly speculative, but we shall give it a try.)
One of the difficulties that arise in machine translation relates to the translation of expressions. This leads us to mention one of the required skills of a superintelligence. It is the ability to identify an expression within a text in a given language and then to translate it into another language. Let us mention that expressions are of different types: verbal, nominal, adjectival, adverbial, … To fix the ideas we can focus here on verbal expressions. For example, the French expression ‘couper les cheveux en quatre’ (litterally, cut the hairs in four, i.e. to split hairs), which translates into Corsican language into either castrà i falchetti (litterally, to chastise the hawks) or castrà i cucchi (litterally, to chastise the cuckoos). In order to properly translate such an expression, a superintelligence must be able to:

  • identify ‘couper les cheveux en quatre’ as a verbal expression in a French corpus
  • identify castrà i falchetti as a verbal expression within a Corsican corpus
  • associate the two expressions as the proper translation of each other

It appears here that such an aptitude falls under the scope of AGI (Artificial general intelligence).

Superintelligent machine translation (updated)

Let us consider superintelligence with regard to machine translation. To fix ideas, we can propose a rough definition: it consists of a machine with the ability to translate with 99% (or above) accuracy from one of the 8000 languages to another. It seems relevant here to mention the present 8000 human languages, including some 4000 or 5000 languages which are at risk of extinction before the end of the XXIth century. It could also include relevantly some extinct languages which are somewhat well-described and meet the conditions for building rule-based translation. But arguably, this definition needs some additional criteria. What appears to be the most important is the ability to self-improve its performance. In practise, this could be done by reading or hearing texts. The superintelligent translation machine should be able to acquire new vocabulary from its readings or hearings: not only words and vocabulary, but also locutions (noun locutions, adjective locutions, adverbial locutions, verbal locutions, etc.). It should also be able to acquire new sentence structures from its readings and enrich its database of grammatical sentence structures. It should also be able to make grow its database of word meanings for ambiguous words and instantly build the associate disambiguation rules. In addition, it should be capable of detecting and implementing specific grammatical structures.
It seems superintelligence will be reached when the superintelligent translation machine will be able to perform all that without any human help.

Also relevant in this discussion is the fact, previously argued, that rule-based translation is better suited to endangered langages translation than statistic-based translation. Why? Because high-scale corpora do not exist for endangered languages. From the above definition of SMT, it follows that rule-based translation is also best suited to SMT, since it massively includes endangered languages (but arguably, statistic-based MT could still be used for translating main languages one into another).

Let us speculate now on how this path to superintelligent translation will be achieved. We can mention here:

  • a quantitative scenario: (i) acquire, fist, an ability to translate very accurately, say, 100 languages. (ii) develop, second, the ability to self-improve (iii) extend, third, the translation ability to whole set of 8000 human languages.
  • alternatively, there could be a qualitative scenario: (i) acquire, first, an ability to translate somewhat accurately the 8000 languages (the accuracy could vary from language to language, especially with rare endangered languages). (ii) suggest improvements to vocabulary, locutions, sentence structures, disambiguation rules, etc. that are verified and validated by human (iii) acquire, third, the ability to self-improve by reading texts or hearing conversations.
  • it is worth mentioning a third alternative that would consist of  an hybrid scenario, i.e. a mix of quantitative and qualitative improvements. It will be our preferred scenario.

But we should provide more details on how these steps could be achieved. To fix ideas, let us focus on the word self-improvement module: it allows the superintelligent machine translation to extend its vocabulary in any language. This could be accomplished by reading or hearing new texts in any language. When facing a new word, the superintelligent machine translation (SMT, for short) should be able to translate it instantly into the 8000 other languages and add it to its vocabulary database.

To give another example, another module would be locution self-improvement module: it allows the superintelligent machine translation to extend its locution knowledge in any language.

Also relevant to this topic is the following question: could SMT be achieved without AGI ( general AI)? We shall address this question later.


Disambiguating ‘nombre de’

Let us consider here the disambiguation of ‘nombre de’ which can be according to the cases:

  • a singular masculine noun followed by a preposition: in this case, ‘nombre de’ translates to numaru di (number of)
  • an indefinite pronoun: in this case, French ‘nombre de’ translates to Corsican into bon parechji (many, a great many)

Si tratta quì di a disambiguazioni di ‘nombre de’ chì pò essa siont’è i casi:

  • un nomu maschili singulari suvitatu da una pripusizioni: in ‘ssu casu, ‘nombre de’ si traduci pà numaru di
  • un prunomu indefinitu: in ‘ssu casu, ‘nombre de’ pò essa traduttu in corsu da bon parechji

Semantic disambiguation of French ‘femme’: in the mud, gold is still shining

In Corsican language, French word ‘femme’ can be translated, depending on the context

  • either into donna (woman)
  • or into moglia (wife)

The above sample still contains a lot of vocabulary and grammatical disambiguation errors (easy/medium difficulty), but it handles successfully the semantic disambiguation (hard) of ‘femme’, two instances of which are properly translated into moglia (wife). As the Corsican proverb says, in a cianga l’oru luci sempri (in the mud, gold is still shining).

French samples are from the French corpora of the University of Leipzig.

Four consecutive ambiguous words

Translating the following sentence: ‘ce fait est unique’ is not as easy as it could seem at first glance. In effect, it is made up of four consecutive ambiguous words:

  • ‘ce’: ‘ssu (demonstrative pronoun, this) or ciò (it, relative pronoun)
  • ‘fait’: fattu (masculine singular noun, fact), fattu (past participe, done) or faci (does, third person singular of the verb to do at the present tense)
  • ‘est’: estu (masculine singular noun, east) or (is, third person singular of the verb to be at the present tense)
  • ‘unique’: unicu (masculine singular adjective, unique in English) or unica (feminine singular adjective, unique in English)