Currently transferring the whole thing to python. The main goal is to realize a module for versatile grammatical disambiguation, i.e. a disambiguation module per grammatical type, suitable for many languages, without major and complicated code changes. The flexibility allowed by the dictionaries and the different types of lists in python seems to be better suited to this project.
Disambiguation is an essential process in machine translation. Sometimes, however, it seems more rational and logical to leave an ambiguity in the translation. This is the case when (i) there is an ambiguous word in the sentence to be translated; and (ii) the context does not provide an objective reason to choose one of the two occurrences. It seems that in this case, the best translation is the one that leaves the ambiguity intact.
Let’s take an example. Consider the following French sentence: ‘Son palais était en feu.’. The French word ‘palais’ is ambiguous, because it corresponds in English and in Corsican to two different words (palace, palazzu and palate, palatu).
Thus, we have 3 possibilities of translation:
- His palate was on fire
- His palace was on fire
- His palace/palate was on fire
The third translation, in my opinion, is better, because it points out that the context is insufficient to choose one of the two alternatives.
Consider now, on the one hand, the following sentence: ‘Il avait mangé du piment fort. Son palais était en feu.’ Now the context provides an objective motivation to choose one of the two occurence. This yields the following translation: He had eaten some hot pepper. His palate was on fire.
On the other hand, consider the following sentence: ‘Les ennemis du prince avaient lancé des engins incendiaires. Son palais était en feu.’ We also have here an objective reason to choose the other alternative. It translates then: The prince’s enemies had thrown incendiary devices. His palace was on fire.
The challenge is especially that of generalizing the grammatical word-disambiguation to several languages. Creating a module of grammatical word-disambiguation for each language appears to be a long and arduous task. This seems to be the main difficulty. But if a module specific to a given language can be generalized to several other languages, this could be an important advance in the field of rule-based machine translation (which simulates human reasoning seems to me a more appropriate term).
We can describe the problem more precisely. We have about 100 grammatical categories for a given language. We also have about 300 ambiguous grammatical types – to fix ideas – which are: e.g., adverb or preposition, singular masculine noun or singular masculine adjective, etc. The problem is to describe an algorithm to remove the ambiguity and determine the corresponding grammatical type according to the context.
Now rewriting the complete module of disambiguation by grammatical type, so that it can be used and adapted to other languages (Italian in the first place). It remains to be seen if this can be done.
The question of choosing the best system to solve the problems posed by word disambiguation in the field of translation seems to be linked to the AGI control problem (how to avoid that an AGI finally turns out to be harmful for its creators). It seems that when we have the choice between several methods to develop an AI, it is wiser to choose the one that allows a better control of the AGI. As far as machine translation is concerned, we should thus prefer in this regard the method that emulates human reasoning, and that produces a response that can be broken down step by step into the reasoning that leads to it. This makes it possible to accurately determine the cause of an error, but also to remedy it. This problem does not only concern machine translation, but has a somewhat extended scope. For grammatical disambiguation concerns machine translation, but also the understanding of natural language, and disambiguation according to context, in the very absence of any translation.
Grammatical disambiguation – i.e. whether ‘maintenant’ is and adverb (now) or the gerundive (maintaining) of the verb ‘maintenir’ – seems to be the crucial issue for the adoption of the rule-based model or statistical model for machine translation. This problem is widespread and seems to concern all languages. For the French language, this problem of grammatical disambiguation concerns about 1 word out of 7. Effective grammatical disambiguation is difficult to implement. The advantage of adopting the statistical method for grammatical disambiguation is that the same method can be generalized and used for several languages. In the case of the rule-based model, the module of grammatical disambiguation must be rewritten for each language, which generates considerable complexity and requires a very significant development time. Therefore, a rule-based method for grammatical disambiguation that can be easily applied to several languages would be of great interest. This seems to be the main difficulty that rule-based machine translation is designed to overcome.
But if we want an artificial intelligence that not only provides an (mostly accurate) answer without being able to really explain its reasoning, but is truly able to emulate human reasoning and to justify and describe step by step the reasoning that leads to its answer, then it is worth the effort.
Let us comment on the remaining errors encountered in the above open test:
- French ‘carrière’ remains undisambiguated: either carriera (career) or cava (quarry): two occurrences
- ‘de’: French ‘de’ is perhaps the most difficult word to translate into another language, due to its general polymorphism
- ‘national-socialiste’: missing vocabulary
- l’ within ” l’empeche “: pronoun error
- it should be pointed out that ‘Etats-Unis’ remains untranslated due to the fact that it is erroneously written, with a beginning E instead of É
The result is 1 – (5/169) = 97.04%. To be noticed: ambiguous French word ‘partie’ (‘durant la première partie’, during the first part) is correctly disambiguated into parti (part), instead of partita (game, match).
It seems that an average result of 95% is currently being consolidated, and that an average result of 96% is a target that should be achievable within a year.
Let us consider a hard case for word sense disambiguation, in the context of French to Corsican MT. But the same goes for French to English MT. It relates to French words such as: ‘accomplit’, ‘affaiblit’, ‘affranchit’, ‘alourdit’, ‘amortit’. The corresponding verbs ‘accomplir’ (to fulfill, to accomplish), ‘affaiblir’ (to weaken), ‘affranchir’ (to free), ‘alourdir’ (to burden), ‘amortir’ (to damp) have the same word for simple present and simple past at the third person singular: respectively ‘accomplit’, ‘affaiblit’, ‘affranchit’, ‘alourdit’, ‘amortit’. The upshot is that a single sentence such as: ‘Il affaiblit sa position.’ can be translated either into he weakens his position or into he weakened his position. If the context is unambiguous with regard to the sence of the discourse, the correct tense can be adequately chosen. But in the lack of informative context, it would be opportune to let the ambiguity prevail.
It should be pointed out that any such verbs are not rare. A more complete list includes: accomplit, affaiblit, affranchit, alourdit, amortit, anéantit, anoblit, aplatit, arrondit, assombrit, bannit, bâtit, blanchit, blondit, démolit, éblouit, emplit, enfouit, enhardit, enlaidit, ennoblit, envahit, épaissit, étourdit, exclut, franchit, glapit, investit, jaunit, jouit, munit, noircit, obéit, obscurcit, occit, périt, réagit, régit, réjouit, remplit, répartit, resplendit, rétrécit, rit, rougit, rouvrit, saisit, sévit, surgit.
Let us focus on grammatical type disambiguation, which is a subproblem of word disambiguation. General grammatical types are: verbs, nouns, adjectives, adverbs, prepositions, gerundive, etc. But for grammatical type disambiguation purposes, more accuracy is in order: instances of grammatical types are then: masculine singular noun, feminine singular noun, masculine plural noun, feminine plural noun, masculine singular adjective, feminine singular adjective, masculine plural adjective, feminine plural adjective, adverbs, prepositions, gerundive, etc. Now grammatical type disambiguation can occur between two different grammatical types (in the above-mentioned form). For example, an ambiguity can occur between preposition and gerundive. In French, this is notably the case for ‘devant’ and ‘maintenant’. For ‘devant’ can either be an adverb (in front) or a gerundive (from the verb ‘devoir’, to have to). Similarly, ‘maintenant’ can either be an adverb (now) or a gerundive (from the verb ‘maintenir’, to maintain). It should be clear now that ‘devant’ and ‘maintenant’ are both ambiguous with regard to their grammatical type. In English, depending on the relevant grammatical type, ‘devant’ is ambiguous between having to or in front). In the same way, ‘maintenant’ is ambiguous between now and maintening.
In order to disambiguate French words ‘devant’ or ‘maintenant’, rule-based MT needs a disambiguation module that is able to distinguish whether ‘devant’ or ‘maintenant’ are adverbs or gerundives.
(not to mention the fact that ‘devant’ can also be a preposition, for the sake of clarity).
The issue of pair reversal: it goes as follows: Suppose your have a given translation pair A>B that translates language A into language B, how hard is it to build the reverse pair B>A? Now the current instance of this problem goes as follows: given the French>Italian pair, how hard is it to build an Italian>French pair? To state it more explicitly : could AI help build a reverse pair in a very short time. Arguably, if AI could build such reverse pair shortly, it seems it would be some kind of breakthrough. Supposedly, we do not expect a 100% efficiency and accuracy in this reversal process, but if some 98% or 99% were possible, it would do the job. For AI within MT is not only targeted at translating, it is also targeted at constructing translation engines.
Just tested pair reversal from French-Italian to Italian-French. Well, some 70% can be made automatically, but a big issue is still remaining, that relates to the disambiguation of Italian words. The disambiguation engine seems to be the crux of the matter here. The uupshot is that the entire disambiguation module needs to be rewritten, in order (if possible) to be language-related. The new module must be more AI-focused. If successful, it could open the path to the (somewhat) fast construction of a multi-language ecosystem with a rule-based MT architecture.
We will be interested in a series of posts to try to define what is required of an AGI (Artificial General Intelligence) in order to reach the level of superintelligence in MT (machine translation). (All this is highly speculative, but we shall give it a try.)
One of the difficulties that arise in machine translation relates to the translation of expressions. This leads us to mention one of the required skills of a superintelligence. It is the ability to identify an expression within a text in a given language and then to translate it into another language. Let us mention that expressions are of different types: verbal, nominal, adjectival, adverbial, … To fix the ideas we can focus here on verbal expressions. For example, the French expression ‘couper les cheveux en quatre’ (litterally, cut the hairs in four, i.e. to split hairs), which translates into Corsican language into either castrà i falchetti (litterally, to chastise the hawks) or castrà i cucchi (litterally, to chastise the cuckoos). In order to properly translate such an expression, a superintelligence must be able to:
- identify ‘couper les cheveux en quatre’ as a verbal expression in a French corpus
- identify castrà i falchetti as a verbal expression within a Corsican corpus
- associate the two expressions as the proper translation of each other
It appears here that such an aptitude falls under the scope of AGI (Artificial general intelligence).