Currently transferring the whole thing to python. The main goal is to realize a module for versatile grammatical disambiguation, i.e. a disambiguation module per grammatical type, suitable for many languages, without major and complicated code changes. The flexibility allowed by the dictionaries and the different types of lists in python seems to be better suited to this project.
Disambiguation is an essential process in machine translation. Sometimes, however, it seems more rational and logical to leave an ambiguity in the translation. This is the case when (i) there is an ambiguous word in the sentence to be translated; and (ii) the context does not provide an objective reason to choose one of the two occurrences. It seems that in this case, the best translation is the one that leaves the ambiguity intact.
Let’s take an example. Consider the following French sentence: ‘Son palais était en feu.’. The French word ‘palais’ is ambiguous, because it corresponds in English and in Corsican to two different words (palace, palazzu and palate, palatu).
Thus, we have 3 possibilities of translation:
- His palate was on fire
- His palace was on fire
- His palace/palate was on fire
The third translation, in my opinion, is better, because it points out that the context is insufficient to choose one of the two alternatives.
Consider now, on the one hand, the following sentence: ‘Il avait mangé du piment fort. Son palais était en feu.’ Now the context provides an objective motivation to choose one of the two occurence. This yields the following translation: He had eaten some hot pepper. His palate was on fire.
On the other hand, consider the following sentence: ‘Les ennemis du prince avaient lancé des engins incendiaires. Son palais était en feu.’ We also have here an objective reason to choose the other alternative. It translates then: The prince’s enemies had thrown incendiary devices. His palace was on fire.
The main difficulty here seems to lie in the adaptation of the grammatical disambiguation module. Indeed, for the French language, such a module performs disambiguation with respect to about 100 categories. The number of pairs (or 3-tuples, 4-tuples, etc.) of disambiguation, for French, is about 250. The question is: when we change languages, how many categories of n-tuples of disambiguation does this result in? In particular, when one switches from French to Italian, does this result in a big change in the categories to be disambiguated?
Let’s take an example, with a particular category of words to disambiguate. One such category is for example AQfs/Vsing3present (feminine singular adjective or verb in the 3rd person singular present tense). A word in Italian that belongs to this type is ‘stanca’. So we have both uses:
- ‘è stanca’ (she is tired): AQfs
- stanca il cavallo’ (it tires the horse): Vsing3present
In French, we don’t have this kind of disambiguation category directly because the category concerned is broader than that: it includes at least the 1st person singular of the present. Thus we have the word ‘sèche’, which belongs to this type of disambiguation category:
- ‘la feuille est sèche’ (the leaf is dry): AQfs
- ‘je sèche mes cheveux’ (I dry my hair): Vsing1present
- ‘il sèche sa chemise’ (he dries his shirt): Vsing3present
Of course, the code that allows the disambiguation of AQfs/Vsing1present/Vsing3present should also allow the derivation of the disambiguation of AQfs/Vsing3present. But this gives an idea of the kind of problems that arise and the adaptation needed.
If the types of disambiguation are very different from one language to another, it will be necessary to have a disambiguation module which is capable of adapting to many new types of disambiguation and which is therefore very flexible. This appears to be a considerable difficulty for the creation of an eco-system. It seems that Apertium, faced with this difficulty, has chosen a statistical module as a solution for its eco-system. However, the question of whether such a flexible module, adaptable without difficulty from one language to another, is feasible in the context of rule-based MT, remains an open question.
The challenge is especially that of generalizing the grammatical word-disambiguation to several languages. Creating a module of grammatical word-disambiguation for each language appears to be a long and arduous task. This seems to be the main difficulty. But if a module specific to a given language can be generalized to several other languages, this could be an important advance in the field of rule-based machine translation (which simulates human reasoning seems to me a more appropriate term).
We can describe the problem more precisely. We have about 100 grammatical categories for a given language. We also have about 300 ambiguous grammatical types – to fix ideas – which are: e.g., adverb or preposition, singular masculine noun or singular masculine adjective, etc. The problem is to describe an algorithm to remove the ambiguity and determine the corresponding grammatical type according to the context.
Now rewriting the complete module of disambiguation by grammatical type, so that it can be used and adapted to other languages (Italian in the first place). It remains to be seen if this can be done.
Let us briefly recall the problem: translating ‘I love you’ might sound trivial, but it’s not. In fact, ‘ti amu‘ is not the best translation. The best translation is ‘ti tengu caru‘ when addressed to a male person, or ‘ti tengu cara‘ when addressed to a female person. Hence the proposed preliminary translation ‘ti tengu caru/cara‘. Such rough translation requires further disambiguation, but on what precise grounds?
Let us look at the issue from an analytical perspective. It appears that we need to assign a reference to the pronoun ‘te’ (you, ti). The latter could be identified according to the context, depending on whether the person ‘te’ refers to is male or female. At this stage, it appears that it is better to consider that the personal object pronoun has an inherent gender: masculine or feminine. This gender does not affect the pronoun itself which remains ‘te’ (you, ti) independently of the gender, but it does have an effect on the words that depend on it, i.e. the adjective caru/cara in Corsican, in the locution ti tengu caru/cara. The upshot is: in this case, ‘te’ (you, ti) is a personal object pronoun, masculine or feminine, whose inherent ambiguity can be solved according to the context.
Let’s take another look at polymorphic disambiguation. We shall consider the French word sequence ‘nombre de’. The translation into Corsican (the same goes for English and other languages) cannot be identical, because ‘number of’ can be translated in two different ways. In the sequence ‘mais nombre de poissons sont longs’ (but many fish are long), ‘number of’ is an indefinite determiner: it translates as bon parechji (many). On the other hand, in the sequence ‘mais le nombre de poissons est supérieur à dix’ (but the number of fish is greater than ten), ‘nombre de’ is a common name followed by the preposition ‘de’: it is translated by numaru di (number of). Statistical MT does usually better than human-like (rule-based) MT at polymorphic disambiguation (I did a test with both sentences with Deepl and Google translate, and both of them successfully solve the relevant polymorphic disambiguation), but it turns out that human-like (rule-based) MT is also capable of handling that.
Let us comment on the remaining errors encountered in the above open test:
- French ‘carrière’ remains undisambiguated: either carriera (career) or cava (quarry): two occurrences
- ‘de’: French ‘de’ is perhaps the most difficult word to translate into another language, due to its general polymorphism
- ‘national-socialiste’: missing vocabulary
- l’ within ” l’empeche “: pronoun error
- it should be pointed out that ‘Etats-Unis’ remains untranslated due to the fact that it is erroneously written, with a beginning E instead of É
The result is 1 – (5/169) = 97.04%. To be noticed: ambiguous French word ‘partie’ (‘durant la première partie’, during the first part) is correctly disambiguated into parti (part), instead of partita (game, match).
It seems that an average result of 95% is currently being consolidated, and that an average result of 96% is a target that should be achievable within a year.
Let us consider a hard case for word sense disambiguation, in the context of French to Corsican MT. But the same goes for French to English MT. It relates to French words such as: ‘accomplit’, ‘affaiblit’, ‘affranchit’, ‘alourdit’, ‘amortit’. The corresponding verbs ‘accomplir’ (to fulfill, to accomplish), ‘affaiblir’ (to weaken), ‘affranchir’ (to free), ‘alourdir’ (to burden), ‘amortir’ (to damp) have the same word for simple present and simple past at the third person singular: respectively ‘accomplit’, ‘affaiblit’, ‘affranchit’, ‘alourdit’, ‘amortit’. The upshot is that a single sentence such as: ‘Il affaiblit sa position.’ can be translated either into he weakens his position or into he weakened his position. If the context is unambiguous with regard to the sence of the discourse, the correct tense can be adequately chosen. But in the lack of informative context, it would be opportune to let the ambiguity prevail.
It should be pointed out that any such verbs are not rare. A more complete list includes: accomplit, affaiblit, affranchit, alourdit, amortit, anéantit, anoblit, aplatit, arrondit, assombrit, bannit, bâtit, blanchit, blondit, démolit, éblouit, emplit, enfouit, enhardit, enlaidit, ennoblit, envahit, épaissit, étourdit, exclut, franchit, glapit, investit, jaunit, jouit, munit, noircit, obéit, obscurcit, occit, périt, réagit, régit, réjouit, remplit, répartit, resplendit, rétrécit, rit, rougit, rouvrit, saisit, sévit, surgit.
Let us focus on grammatical type disambiguation, which is a subproblem of word disambiguation. General grammatical types are: verbs, nouns, adjectives, adverbs, prepositions, gerundive, etc. But for grammatical type disambiguation purposes, more accuracy is in order: instances of grammatical types are then: masculine singular noun, feminine singular noun, masculine plural noun, feminine plural noun, masculine singular adjective, feminine singular adjective, masculine plural adjective, feminine plural adjective, adverbs, prepositions, gerundive, etc. Now grammatical type disambiguation can occur between two different grammatical types (in the above-mentioned form). For example, an ambiguity can occur between preposition and gerundive. In French, this is notably the case for ‘devant’ and ‘maintenant’. For ‘devant’ can either be an adverb (in front) or a gerundive (from the verb ‘devoir’, to have to). Similarly, ‘maintenant’ can either be an adverb (now) or a gerundive (from the verb ‘maintenir’, to maintain). It should be clear now that ‘devant’ and ‘maintenant’ are both ambiguous with regard to their grammatical type. In English, depending on the relevant grammatical type, ‘devant’ is ambiguous between having to or in front). In the same way, ‘maintenant’ is ambiguous between now and maintening.
In order to disambiguate French words ‘devant’ or ‘maintenant’, rule-based MT needs a disambiguation module that is able to distinguish whether ‘devant’ or ‘maintenant’ are adverbs or gerundives.
(not to mention the fact that ‘devant’ can also be a preposition, for the sake of clarity).
Let us consider here the disambiguation of ‘nombre de’ which can be according to the cases:
- a singular masculine noun followed by a preposition: in this case, ‘nombre de’ translates to numaru di (number of)
- an indefinite pronoun: in this case, French ‘nombre de’ translates to Corsican into bon parechji (many, a great many)
Si tratta quì di a disambiguazioni di ‘nombre de’ chì pò essa siont’è i casi:
- un nomu maschili singulari suvitatu da una pripusizioni: in ‘ssu casu, ‘nombre de’ si traduci pà numaru di
- un prunomu indefinitu: in ‘ssu casu, ‘nombre de’ pò essa traduttu in corsu da bon parechji