Dictionary = Corpus?

As far as machine translation is concerned, it seems that the best thing is to combine the best of the two approaches: rule-based or statistic-based. If it were possible to converge the two approaches, it seems that the benefit could be great. Let us try to define what could allow such a convergence, based on the two-sided grammatical approach. Let us try to illustrate this with a few examples.
To begin with, u soli sittimbrinu = ‘le soleil de septembre’ (the sun of September). In Corsican language, sittimbrinu is a masculine singular adjective that means ‘de septembre’ (of September). In French, ‘de septembre’ is–from an analytic perspective–a preposition followed by a common masculine singular noun. But according to the two-sided analysis ‘de septembre’ (of September) is also–from a synthetic perspective–a masculine singular adjective. This double nature, according to this two-sided analysis of ‘de septembre’, allows in fact the alignment of ‘de septembre’ (of September) with sittimbrinu.
More generally, if we define words or groups of words according to the two-sided grammatical analysis in the dictionary, we also have an alignment tool, which can be used for a translation system based on statistics, in the same way as a corpus. Thus, if it is sufficiently provided, the dictionary is also a corpus, and even more, an aligned corpus.

Grammatical taxonomy again: the case of prepositions

Let’s look at the translation of the word ‘whose’. Depending on the case, ‘whose’ can be a

  • relative pronoun: ‘la difficulté dont je t’ai parlé’ (the difficulty I told you about), ‘voilà le professeur dont j’apprécie beaucoup les cours’ (this is the teacher whose classes I really enjoy.)
  • or, more rarely, a preposition: ‘il y avait cinq couleurs, dont le rouge et le bleu’. (there were five colours, including red and blue.)

It is the latter case that we will be looking at. In this case, ‘dont’ is translated into English as ‘including’. In Corsican, the translation is: c’eranu cinque culori, frà i quali u rossu è u turchinu. But if we translate ‘il y avait cinq plantes, dont le ciste et la bruyère’ (‘there were five plants, including cistus and heather’), we get: c’eranu cinque piante, frà e quale u muchju è a scopa. Thus the translation of ‘dont’ (including) as a preposition is either frà i quali (masculine plural, culore being masculine in Corsican) or frà e quale (feminine plural), depending on which noun ‘dont’ refers to.

Thus ‘dont’ is translated into the masculine plural or the feminine plural, depending on the noun – either masculine or feminine – to which it refers. This casts doubt on the ‘prepositional’ nature of ‘dont’, and leads to further analysis to determine whether there might not be a more suitable grammatical type.

It is worth noting that ‘dont (including) can be replaced by ‘parmi lequels’ (among which, frà i quali) or ‘parmi lesquelles’ (among which, frà e quale) depending on whether the noun to which ‘whose’ refers is in the masculine plural or the feminine plural. This suggests that ‘whose’ could be conceived of as a preposition followed by a pronoun. In the spirit of this analysis, the BDL site notes: ‘Dont’ is probably the relative pronoun whose use is the most delicate. To use it correctly, one must know that dont always ‘hides’ the preposition ‘de’; ‘dont’ is equivalent to ‘de qui’, ‘de quoi’, ‘duquel’, etc. This link between ‘dont’ and ‘de’ goes back to the Latin origin of ‘dont’, which is from ‘unde’ “from where”.

More generally, this suggests that further analysis of some prepositions may be needed.

Creating new grammatical types

Italian has ‘prepositions followed by articles’ (preposizione articolate). This is a specific grammatical type, which refers to a word (e.g. della) that replaces a preposition (di) followed by an article (la):

	il	lo	l’	la	i	gli	le
di	del	dello	dell’	della	dei	degli	delle
a	al	allo	all’	alla	ai	agli	alle
da	dal	dallo	dall’	dalla	dai	dagli	dalle
in	nel	nello	nell’	nella	nei	negli	nelle
su	sul	sullo	sull’	sulla	sui	sugli	sulle

This specific grammatical type also corresponds to:

  • in French: du = de le, des = de les
  • in Corsican and especially in the Sartenese variant: ‘llu = di lu, ‘lla = di la, etc.

This raises the general problem of the number of grammatical types we should retain. Should we create new grammatical types beyond the classical ones, in order to optimise translators and NLP in general? What is the best grammatical type to retain for ‘prepositions followed by an article’: a new primitive one or a compound one (always keeping Occam’s razor in mind)? A preposition followed by an article behaves like a preposition for words on its left, and like an article for words on its right.

Evaluation of the performance after changes

Just performed a series of open tests, using the (pseudo-random) article of the day from wikipedia in French.The results are the following, concerning the Taravese version of the Corsican language:
that is to say an average of about 95%, taking into account that the ‘cismuntinca’ version generally obtains a slightly lower result, because of the masculine and feminine plurals which are different (whereas they are identical in Taravese).

Grammatical word-disambiguation again and again

The main difficulty here seems to lie in the adaptation of the grammatical disambiguation module. Indeed, for the French language, such a module performs disambiguation with respect to about 100 categories. The number of pairs (or 3-tuples, 4-tuples, etc.) of disambiguation, for French, is about 250. The question is: when we change languages, how many categories of n-tuples of disambiguation does this result in? In particular, when one switches from French to Italian, does this result in a big change in the categories to be disambiguated?

Let’s take an example, with a particular category of words to disambiguate. One such category is for example AQfs/Vsing3present (feminine singular adjective or verb in the 3rd person singular present tense). A word in Italian that belongs to this type is ‘stanca’. So we have both uses:

  • ‘è stanca’ (she is tired): AQfs
  • stanca il cavallo’ (it tires the horse): Vsing3present
    In French, we don’t have this kind of disambiguation category directly because the category concerned is broader than that: it includes at least the 1st person singular of the present. Thus we have the word ‘sèche’, which belongs to this type of disambiguation category:
  • ‘la feuille est sèche’ (the leaf is dry): AQfs
  • ‘je sèche mes cheveux’ (I dry my hair): Vsing1present
  • ‘il sèche sa chemise’ (he dries his shirt): Vsing3present

Of course, the code that allows the disambiguation of AQfs/Vsing1present/Vsing3present should also allow the derivation of the disambiguation of AQfs/Vsing3present. But this gives an idea of the kind of problems that arise and the adaptation needed.

If the types of disambiguation are very different from one language to another, it will be necessary to have a disambiguation module which is capable of adapting to many new types of disambiguation and which is therefore very flexible. This appears to be a considerable difficulty for the creation of an eco-system. It seems that Apertium, faced with this difficulty, has chosen a statistical module as a solution for its eco-system. However, the question of whether such a flexible module, adaptable without difficulty from one language to another, is feasible in the context of rule-based MT, remains an open question.

First feasability test: dictionary morphing

The first test carried out to transform the dictionary (in the extended sense) based on the French-Corsican pair, into a dictionary related to the Italian-Gallurian pair, shows that it is feasible. The result – of an acceptable but perfectible quality – is obtained in 21 minutes (with 16 GO RAM & Intel core i7-8550U CPU). We start with a multi-lingual dictionary based on French entries, and the final result is an Italian-Gallurese dictionary.

Translation from Italian to Gallurese

Our new project will be to try to implement the translation from Italian into Gallurese. For this is an essential pair for the Gallurese language, which is a priority. The major difficulty in doing this is:
– on the one hand, to (automatically) transform the dictionary (in the extended sense) based on the French-Corsican pair, into a dictionary related to the Italian-Gallurese pair
– on the other hand, to implement automatically (without having to rewrite them entirely) the other modules, and in particular the one based on grammatical disambiguation.

The stakes here seem high. It is a question of transforming a system that can translate one pair of languages (i.e. French into Corsican) into an eco-system that can translate several pairs of languages (the target language of which being an endangered language).

Adjective modifiers again

We will consider again a category of words such as ‘very’, when they precede an adjective. Traditionally, this category is termed ‘adverbs’ or ‘adverbs of degree’, but we prefer ‘adjective modifier’, because (i) analytically, they change the meaning of an adjective and (ii) synthetically, an adjective modifier followed by an adjective is still an adjective. A more complete list is: almost, absolutely, badly, barely, completely, decidedly, deeply, enormously, entirely, extremely, fairly, fully, greatly, hardly, highly, how, incredibly, intensely, less, most, much, nearly, perfectly, positively, practically, pretty, purely, quite, rather, really, scarcely, simply, somewhat, strongly, terribly, thoroughly, totally, utterly, very, virtually, well.

If we look at sentences such as: il est bien content (he is very happy, hè beddu cuntenti), ils étaient bien contents (they were very happy, erani beddi cuntenti), elle serait bien contente (she would be very happy, saria bedda cuntenti), elles sont bien contentes (they are very happy, sò beddi cuntenti), we can see that the modifier of the adjective ‘bien’ is rendered as very in English and in Corsican as:

  • bellu/beddu: singular masculine
  • belli/beddi: plural masculine
  • bella/bedda: feminine singular
  • belle/beddi: feminine plural

This shows that the adjective modifier is invariable in French and English, but varies in gender and number in Corsican. Thus, in Corsican grammar, it seems appropriate to distinguish between:

  • singular masculine adjective modifier
  • plural masculine adjective modifier
  • singular feminine adjective modifier
  • plural feminine adjective modifier

On the other hand, such a distinction does not seem useful in English and French, where the category of ‘adjective modifier’ is sufficient and there is no need for further detail.

On ‘reflexive pronouns’

Pursuing the reflection on grammatical categories, we will examine now “reflexive pronouns”. These are:

  • me te se nous vous se (French)
  • mi ti si ci vi si (Corsican)
  • myself yourself himself/herself/itself ourselves yourselves themselves

Let us take an example:

  • je me promène, tu te promènes, il se promène, nous nous promenons, vous vous promenez, ils se promènent
  • I walk, you walk, he walks, we walk, we walk, you walk, they walk
  • spassieghju, spassieghji, spassieghja, spassiemu, spassieti, spassièghjani

These reflexive pronouns are usually associated with so-called pronominal verbs.
From our point of view, this classification as ‘pronouns’ is unsatisfactory, because they always precede a verb,1 but are placed after a personal subject pronoun, an indefinite pronoun, or a nominal group. In particular, the notion of pronoun following a pronoun is not coherent, from the point of view of our analysis, where the main criterion for typology is the position of a given grammatical type in relation to another.

Let us recall here that the idea behind this reconstruction of grammatical typology is the hypothesis that traditional classification lacks coherence and that this considerably hinders the development of natural language analysis and, at the same time, the development of machine translation modules based on the emulation of human reasoning.

This example suggests that the classic ‘reflexive pronoun’ is a word that introduces into the verb to which it refers a notion of reflexivity of action. In this sense, it is more of a specialized verb modifier. It is thus more akin to the adverb in the sense that we have defined it, i.e. a verb modifier in the broad sense. The adverb in this sense can be placed before or after the verb. On the other hand, the reflexive verb modifier as we have defined it can only be placed in French before the verb.

1 I oversimplify here, since there are also some structures like: tu t’en souviens (you remember it, ti n’inveni).