Tag Archives: Corsican language

How rule-based and statistical machine translation can help each other

Here are a few suggestions on how rule-based and statistical machine translation  can help each other:

To begin with, rule-based and statistical machine translation are often contrasted and compared: it would be oversimplifying to conclude that one is better than the other. From a more objective standpoint, let us consider that each method has its strengths and weaknesses. Let us investigate on how one could make them collaborate in order to add up their respective strengths

in the case of an endangered language, the lack of good quality corpora has been pointed out. But one way for rule-based and statistical machine translation to collaborate would be to use rule-based translation for building a better quality corpus for statistical machine translation

suppose we begin with a statistical machine translation software that performs 50% on average with regard to French to Corsican translation

let us sketch the process of creating these better corpora: let us take the example of the French-Corsican diglossic pair (the Corsican language being considered by Unesco as a definitely endangered language). Now presently we lack a quality French-Corsican corpus or to say it more accurately, the corpus at our disposal is a low-quality one. The idea would be to use rule-based machine translation to create a much better corpus to use with statistical machine translation.

let us sketch now the different steps of this collaborative process: (i) create a French-Corsican corpus with the help of rule-based machine translation: if the software has some average 90% performance, then the corpus would be on average 90% reliable. With appropriate training, statistical MT should now perform some, say, 80% on average (to be compared with the previous 50% performance)
(ii) from this French-Corsican corpus, other corpora pairs can be created, such as Italian-Corsican, English-Corsican, etc. since French-Italian, English-Italian, etc. corpora of excellent quality already exist. The performance gain should then extend to other language pairs such as Italian-Corsican, English-Corsican, etc.

with the help of this process, we re finally in a position to combine and add up the strengths of the two complementary approaches to MT: on the one hand, rule-based MT is able to translate with good accuracy even in the lack of corpora; on the other hand, statistical machine translation is able to handle successfully and fastly a great many language pairs. To sum up, as the Corsican proverb says: una mani lava l’altra (One hand washes the other).

French ‘fin’ followed by a year number: fixed

Tagger improvement: fixed this issue. French ‘l’Empire allemand’ now translates properly into l’Imperu alimanu (the German Empire). French word ‘fin’ is now identified as a preposition when followed by a year number.

The above excerpt is translated into the ‘sartinesu’ variant of Corsican language.

This issue relates to the more general problem of the grammatical status of numbers, a problem to which we shall return later.

Translation of preposition ‘à’ followed by noun phrase denoting a location

‘au stade de Wembley’ (at the Wembley Stadium) should translate in u stadiu di Wembley.

We face the issue of the translation of preposition ‘à’ since ‘au’ is short for ‘à le’ (to the), in particular when ‘à’ is followed by a noun phrase denoting a location. This occurs in the disambiguation of French ‘à’ which can can either translate into à (to) or into in (in).

Accordance of past participe

Now scoring 1 – 2/129 = 98.44%.

  • The issue of past participe’s accordance again: ‘une session du parlement tenue à Nuremberg’ (a session of the Parliament held in Nuremberg) should translate into una sessione di u parlamentu tenuta in Nuremberg. Past participe tenuta should accord with sessione (feminine, session) and not with parlamentu (masculine, Parliament). This could need dependency parsing, but it could be insufficient. Perhaps (harder) semantic disambiguation is required in this case.
  • One false positive: ‘des’, being a Deutsch word, should remain untranslated.

Past participe or present simple: the disambiguation of French ‘construit’

In the present case, it should read, custruitu à u seculu XII (built in the 12th century). The error relates to the disambiguation of French ‘construit’. It can translate into:

  • custruitu (built): past participe, masculine, singular
  • custruisce (builds): present simple, third person

MT should (i) find the proper reference of ‘construit’, i.e. ‘clocher’ (church tower), but above all (ii) whether  ‘construit’ is a past participe or a present simple. Some kind of dependency parser is in order…

The disambiguation of French ‘fils’ again: scoring 98.42%

Scoring 1 – 2/127 = 98.42%. Of interest:

  • ‘de 839 à sa mort’ (from 839 to his death) should read: da u 839 à a so morte. French ‘de’ translates either into di or into da in Corsican language (to simplify matters, since in certain cases, being a partitive article, it translates into nothing).
  • now we face again the multi-ambiguous French ‘fils’, which can translate into: i) figliolu, masculine, singular (son) ii) figlioli, masculine, plural (sons) iii) fili, masculine, plural (wire/wires). In the present case, ‘Fils du roi…’ should translate Figliolu di u rè… (Son of King…).

To notice: five consecutive 100% sentences.

With regard to the Feigenbaum test: failed again. Arguably, the first error is of an acceptable kind, in this context. But the ‘fils’ error is a gross one, that a human would not do…

Can translation help teaching an endangered language?

Can translation help self-teaching and endangered language? It seems yes, it the translation is accurate. Let us check with the verb parlà (to speak). In this case, the translation is 100% accurate, so it can help (but we need to check other verb categories and other tenses). Other verbs of the same group are verbs that end with : manghjà (to eat), saltà (to jump), cantà (to sing), etc.

To begin with: conjugations, present simple:

  • je parle (I speak), tu parles (you speak), il/elle parle (he/she speaks),
    nous parlons (we speak), vous parlez (you speak), ils/elles parlent (they speak)
  • je parlais (I was speaking), tu parlais (you were speaking), il/elle parlait (he/she was speaking),
    nous parlions (we were speaking), vous parliez (you were speaking), ils/elles parlaient (they were speaking)
  • je parlerai (I will speak), tu parleras (you will speak), il/elle parlera (he/she will speak), nous parlerons (we will speak), vous parlerez (you will speak), ils/elles parleront (they will speak).

Of interest:

  • French ‘parle’ is ambiguous since it can translate into parlu (I speak) or parla (he/she speaks).
  • French ‘parlais’ is ambiguous since it can translate into parlavu (I was speaking) or parlavi (you were speaking).