We will be interested in a series of posts to try to define what is required of an AGI (Artificial General Intelligence) in order to reach the level of superintelligence in MT (machine translation). (All this is highly speculative, but we shall give it a try.) One of the difficulties that arise in machine translation relates to the translation of expressions. This leads us to mention one of the required skills of a superintelligence. It is the ability to identify an expression within a text in a given language and then to translate it into another language. Let us mention that expressions are of different types: verbal, nominal, adjectival, adverbial, … To fix the ideas we can focus here on verbal expressions. For example, the French expression ‘couper les cheveux en quatre’ (litterally, cut the hairs in four, i.e. to split hairs), which translates into Corsican language into either castrà i falchetti (litterally, to chastise the hawks) or castrà i cucchi (litterally, to chastise the cuckoos). In order to properly translate such an expression, a superintelligence must be able to:
identify ‘couper les cheveux en quatre’ as a verbal expression in a French corpus
identify castrà i falchetti as a verbal expression within a Corsican corpus
associate the two expressions as the proper translation of each other
It appears here that such an aptitude falls under the scope of AGI (Artificial general intelligence).
Here is a short follow-up to the ‘issue of pair reversal’ regarding language pairs. It seems some 90% could be achieved in this reversal process. What is lacking here is an adequate handling of disambiguation. Let us focus on one example. For it is patent in the above example, where Italian ‘venti’ is ambiguous between masculine plural noun (venti, wings) a numeral (vinti, twenty). But such specific ambiguity relating to grammatical types does not exit in French. The upshot is that disambiguation between grammatical types is specific to one given source language, at least in part. It this difficulty could be overcome, a rough 95% of the automatic process would finally be achieved.
(Obviously the current translation is not of an acceptable quality for publication: some 90% at least is in order…)
Anyway, handling sucessfully disambiguation in many languages appears to be the crux matter here. If AI could build sucessfully such disambiguation modules, it seems rule-based translation as a fast-growing ecosystem would be feasible.
Let us consider superintelligence with regard to machine translation. To fix ideas, we can propose a rough definition: it consists of a machine with the ability to translate with 99% (or above) accuracy from one of the 8000 languages to another. It seems relevant here to mention the present 8000 human languages, including some 4000 or 5000 languages which are at risk of extinction before the end of the XXIth century. It could also include relevantly some extinct languages which are somewhat well-described and meet the conditions for building rule-based translation. But arguably, this definition needs some additional criteria. What appears to be the most important is the ability to self-improve its performance. In practise, this could be done by reading or hearing texts. The superintelligent translation machine should be able to acquire new vocabulary from its readings or hearings: not only words and vocabulary, but also locutions (noun locutions, adjective locutions, adverbial locutions, verbal locutions, etc.). It should also be able to acquire new sentence structures from its readings and enrich its database of grammatical sentence structures. It should also be able to make grow its database of word meanings for ambiguous words and instantly build the associate disambiguation rules. In addition, it should be capable of detecting and implementing specific grammatical structures. It seems superintelligence will be reached when the superintelligent translation machine will be able to perform all that without any human help.
Also relevant in this discussion is the fact, previously argued, that rule-based translation is better suited to endangered langages translation than statistic-based translation. Why? Because high-scale corpora do not exist for endangered languages. From the above definition of SMT, it follows that rule-based translation is also best suited to SMT, since it massively includes endangered languages (but arguably, statistic-based MT could still be used for translating main languages one into another).
Let us speculate now on how this path to superintelligent translation will be achieved. We can mention here:
a quantitative scenario: (i) acquire, fist, an ability to translate very accurately, say, 100 languages. (ii) develop, second, the ability to self-improve (iii) extend, third, the translation ability to whole set of 8000 human languages.
alternatively, there could be a qualitative scenario: (i) acquire, first, an ability to translate somewhat accurately the 8000 languages (the accuracy could vary from language to language, especially with rare endangered languages). (ii) suggest improvements to vocabulary, locutions, sentence structures, disambiguation rules, etc. that are verified and validated by human (iii) acquire, third, the ability to self-improve by reading texts or hearing conversations.
it is worth mentioning a third alternative that would consist of an hybrid scenario, i.e. a mix of quantitative and qualitative improvements. It will be our preferred scenario.
But we should provide more details on how these steps could be achieved. To fix ideas, let us focus on the word self-improvement module: it allows the superintelligent machine translation to extend its vocabulary in any language. This could be accomplished by reading or hearing new texts in any language. When facing a new word, the superintelligent machine translation (SMT, for short) should be able to translate it instantly into the 8000 other languages and add it to its vocabulary database.
To give another example, another module would be locution self-improvement module: it allows the superintelligent machine translation to extend its locution knowledge in any language.
Also relevant to this topic is the following question: could SMT be achieved without AGI ( general AI)? We shall address this question later.
Performing now a new open test, with the first 100 (more or less) words of the ‘article of the day’ from French wikipedia: we get 94,02% = 1 – (8 / 134). Several errors (5) result from lack of vocabulary. There are also some grammatical errors (da instead of par o pà, in diducendu instead of diducendu ni) and lastly, the diambiguation of polaccu (Polish) which is erroneous. The disambiguation of ‘partie’ is correct since it can be translated into parti (part)or into partita (gone, party).
Iterated open tests that that there is an average 50% of errors that result from lack of vocabulary. This type of error should be easy to tackle, inasmuch as it does not concerns rare words. Reasonably, a target of 96% or 97% should be attainable on this basis.
It is kind of a minor breakthrough. The translation of French ‘en même temps que’ (at the same time as) is somewhat hard, in that it can take two different forms: either à tempu à or à tempu ch’è, depending on the context. The above examples tackle this sort of difficulty (although not exhaustively).
There exists priority translation pairs, from the standpoint of endangered languages. Such notion of a priority pair (the most useful pair for the current users of the endangered language), regarding a given endangered language. For example, French to Corsican is a priority pair, with respect to other pairs suchas Gallurese-Corsican, English-Corsican or Spanish-Corsican. In this context, any endangered language has its own priority pair. For example, a priority pair for sardinian gallurese is Italian-Gallurese. In the same way, a priority pair for sardinian sassarese is Italian-Sassarese. In an analogous way, a priority pair for sicilian language is Italian-Sicilian.
There is an ongoing debate on whether AI software should go open source or not (for example Bostrom’s paper Strategic Implications of Openness in AI Development). Now our current concern is of whether MT software should go open source or not. Prima facie, for safety reasons, it would be better to render public MT code, thus allowing anyone to check the code and find eventual errors, … Such openness would notably be a defense against the AI control problem , in short, the fact that superintelligence could harm humans. From this standpoint, it seems that publicness of code is much better than privateness. Regarding rule-based translation (the distinction between statistical and rule-based MT is not as clear-cut as one could think at first glance, since some rules could be applied on a statistical basis), it would allow people to check step-by-step the resulting translation. It seems better transparency should be attained accordingly.
Another advantage or publishing the code would be to allow anyone to improve it and extend its capabilities, notably by adding new modules targeted at new languages (human languages’ count being around 7000).
To begin with, let us state the 1% problem, for machine translation: it seems some 99% accuracy in machine translation could be attainable but the remaining 1% (1% is just a given number, somewhat arbitrarily chosen, but useful to to fix ideas) may be hard of even very hard to reach. Now a question arises: is some progress on the remaining 1% problem attainable without general-purpose AI. Prima facie, the answer is no. For it seems that progress on the remaining 1% problem requires, for example, some abilities such as being able to find the translation of a given word on external databases. For it will occur sometimes that the 1% untranslated will be due to the presence of a new word, for instance very recently created, and thus lacking in the MT internal dictionary. In order to find the relevant translated word, the machine should be able to search and find it on external databases (say, the web), just as a human would do. So, solving the remaining 1% problem requires – among other capabilities – any such ability which is part of a general-purpose AI.
Artificial general intelligence (AGI) is prima facie a somewhat abstract notion, that needs to be refined and made more explicit. Problems encountered in implementing machine translation systems can help make this notion more accurate and concrete. The ability to find the translation of a given word on external databases is just one of the required abilities needed to solve the remaining 1% problem. So we shall mention some other abilities of the same type later.