Tag Archives: machine translation

New insight on the issue of pair reversal (updated)

The issue of pair reversal: it goes as follows: Suppose your have a given translation pair A>B that translates language A into language B, how hard is it to build the reverse pair B>A? Now the current instance of this problem goes as follows: given the French>Italian pair, how hard is it to build an Italian>French pair? To state it more explicitly : could AI help build a reverse pair in a very short time. Arguably, if AI could build such reverse pair shortly, it seems it would be some kind of breakthrough. Supposedly, we do not expect a 100% efficiency and accuracy in this reversal process, but if some 98% or 99% were possible, it would do the job. For AI within MT is not only targeted at translating, it is also targeted at constructing translation engines.

Just tested pair reversal from French-Italian to Italian-French. Well, some 70% can be made automatically, but a big issue is still remaining, that relates to the disambiguation of Italian words. The disambiguation engine seems to be the crux of the matter here. The uupshot is that the entire disambiguation module needs to be rewritten, in order (if possible) to be language-related. The new module must be more AI-focused. If successful, it could open the path to the (somewhat) fast construction of a multi-language ecosystem with a rule-based MT architecture.

How to translate ‘Cette phrase est en français’ ? (This sentence is in French) – updated

Let us consider the following French sentence: Le comté de Kronoberg est un comté suédois dont le nom signifie en français ‘Couronne de montagne’. It translates into Corsican: A cuntea di Kronoberg hè una cuntea svedese chì u so nome significheghja in francese ‘Curona di muntagna’. (The County of Kronoberg is a Swedish county whose name means in French ‘Mountain crown’.) But it should be translated more accurately as: A cuntea di Kronoberg hè una cuntea svedese situata in u sudu di u paese, è chì u so nome significheghja in corsu ‘Curona di muntagna’ since the words significheghja in francese (means in French) are utterly false.

Now a semantic difficulty is lurking whose core can be related to self-reference: How should we translate ‘Cette phrase est en français’ ? Self-reference stems here from ‘cette phrase’ (this sentence). Litterally, it translates into: This sentence is in French). But a sense-preserving translation would be: This sentence is in English).

A much complicated instance of self-reference within translation is as follows: ‘Cette phrase ne comprend que sept mots’ (This sentence contains only seven words). It translates into Corsican: ‘Ss’infrasata ùn cumprendi ch’è setti paroli. It is also true of the Corsican translation, but false of the English one, which includes only six words. Arguably, a better English translation, which is sense-preserving is then:
This sentence contains only six words. Such translation ability is currently beyond the scope of present MT. We can tag it as an ability that would be required from superintelligent MT. It would then include: identifying sef-referent parts of discourse, such as: this sentence, these words, this proposition, this paragraph, this text, … But not all self-referring discourse is concerned here. For example, the Liar paradox (this sentence is false) is irrelevant here, since we only place ourselves from the standpoint of MT. Interestingly, such superintelligent ability also requires some meta-knowledge, i.e. the language of the source text and of the target text. For a shift from the source language to the target language is needed here.

What is required from Artificial General Intelligence with regard to Machine Translation?

Illustration from www.pixabay.com

We will be interested in a series of posts to try to define what is required of an AGI (Artificial General Intelligence) in order to reach the level of superintelligence in MT (machine translation). (All this is highly speculative, but we shall give it a try.)
One of the difficulties that arise in machine translation relates to the translation of expressions. This leads us to mention one of the required skills of a superintelligence. It is the ability to identify an expression within a text in a given language and then to translate it into another language. Let us mention that expressions are of different types: verbal, nominal, adjectival, adverbial, … To fix the ideas we can focus here on verbal expressions. For example, the French expression ‘couper les cheveux en quatre’ (litterally, cut the hairs in four, i.e. to split hairs), which translates into Corsican language into either castrà i falchetti (litterally, to chastise the hawks) or castrà i cucchi (litterally, to chastise the cuckoos). In order to properly translate such an expression, a superintelligence must be able to:

  • identify ‘couper les cheveux en quatre’ as a verbal expression in a French corpus
  • identify castrà i falchetti as a verbal expression within a Corsican corpus
  • associate the two expressions as the proper translation of each other

It appears here that such an aptitude falls under the scope of AGI (Artificial general intelligence).

Superintelligent machine translation (updated)

Illustration from pixabay.com

Let us consider superintelligence with regard to machine translation. To fix ideas, we can propose a rough definition: it consists of a machine with the ability to translate with 99% (or above) accuracy from one of the 8000 languages to another. It seems relevant here to mention the present 8000 human languages, including some 4000 or 5000 languages which are at risk of extinction before the end of the XXIth century. It could also include relevantly some extinct languages which are somewhat well-described and meet the conditions for building rule-based translation. But arguably, this definition needs some additional criteria. What appears to be the most important is the ability to self-improve its performance. In practise, this could be done by reading or hearing texts. The superintelligent translation machine should be able to acquire new vocabulary from its readings or hearings: not only words and vocabulary, but also locutions (noun locutions, adjective locutions, adverbial locutions, verbal locutions, etc.). It should also be able to acquire new sentence structures from its readings and enrich its database of grammatical sentence structures. It should also be able to make grow its database of word meanings for ambiguous words and instantly build the associate disambiguation rules. In addition, it should be capable of detecting and implementing specific grammatical structures.
It seems superintelligence will be reached when the superintelligent translation machine will be able to perform all that without any human help.

Also relevant in this discussion is the fact, previously argued, that rule-based translation is better suited to endangered langages translation than statistic-based translation. Why? Because high-scale corpora do not exist for endangered languages. From the above definition of SMT, it follows that rule-based translation is also best suited to SMT, since it massively includes endangered languages (but arguably, statistic-based MT could still be used for translating main languages one into another).

Let us speculate now on how this path to superintelligent translation will be achieved. We can mention here:

  • a quantitative scenario: (i) acquire, fist, an ability to translate very accurately, say, 100 languages. (ii) develop, second, the ability to self-improve (iii) extend, third, the translation ability to whole set of 8000 human languages.
  • alternatively, there could be a qualitative scenario: (i) acquire, first, an ability to translate somewhat accurately the 8000 languages (the accuracy could vary from language to language, especially with rare endangered languages). (ii) suggest improvements to vocabulary, locutions, sentence structures, disambiguation rules, etc. that are verified and validated by human (iii) acquire, third, the ability to self-improve by reading texts or hearing conversations.
  • it is worth mentioning a third alternative that would consist of  an hybrid scenario, i.e. a mix of quantitative and qualitative improvements. It will be our preferred scenario.

But we should provide more details on how these steps could be achieved. To fix ideas, let us focus on the word self-improvement module: it allows the superintelligent machine translation to extend its vocabulary in any language. This could be accomplished by reading or hearing new texts in any language. When facing a new word, the superintelligent machine translation (SMT, for short) should be able to translate it instantly into the 8000 other languages and add it to its vocabulary database.

To give another example, another module would be locution self-improvement module: it allows the superintelligent machine translation to extend its locution knowledge in any language.

Also relevant to this topic is the following question: could SMT be achieved without AGI ( general AI)? We shall address this question later.


Is rule-based MT more ethical than statistical MT?

In the ongoing debate on safe IA, it is a relevant open question of whether rule-based MT is more ethical than statistical MT. Here are some arguments in favor of rule-based MT in this context (without blaming statistical MT which has its own strengths):

  • it emulates human reasoning: it translates a text just as a human would do
  • there is much control on rule-based MT since the resulting translated text can be traced back: a detailed step-by-step translation process can be provided if required
  • rule-based MT can be consistently part of and integrate itself into a whole project of brain emulation, which emulates general human reasoning

Rough typology of remaining errors (updated march 2018)

French to Corsican: performing on French wikipedia sample test currently amounts to 94% on average. Below is a rough typology of remaining errors (presumably an average scoring of 95% on the open test should be attainable on the basis of correction of ‘easy’ tagged errors):

  • unknown vocabulary: 40% (easy)
  • basic disambiguation: 25%  (easy or medium difficulty)
  • false positives: 5% (medium difficulty or hard). This type of error  is mostly related to proper nouns, i.e. English termes that should remain un translated. For example: ‘North American Aviation’ translates erroneously into ‘North American Aviazione’. In this case, ‘Aviation’ should remain untranslated.
  • inadequate locution: 10% (medium difficulty or hard)
  • anaphora resolution related to complex sentence’s structure: 5% (hard)
  • semantic disambiguation: 5% (hard). For example, disambiguating French ‘échecs’ = fiaschi/scacchi (failures/chess)
  • erroneous accord related to gender mismatch from French to Corsican, i.e. (i) words that are masculine in French and feminine in Corsican language; and (ii) ) words that are feminine in French and masculine in Corsican language: 1% (medium difficulty).
  • erroneous accord related to number mismatch from French to Corsican, i.e. (i) words that are singular in French and plural in Corsican language; and (ii) ) words that are plural in French and singular in Corsican language (for example French ‘la canicule’ translates into ‘i sulleoni’ in Corsican language: 1% (medium difficulty).
  • specific grammatical case: 2% (hard)
  • anaphora resolution associated with gender or number mismatch: 1% (hard)
  • unknown, unclassified: 6% (hard)

Evaluation of machine translation: why not self-evaluation?

Evaluation of machine translation is usually done via external tools (to cite some instances: ARPA, BLEU, METEOR, LEPOR, …). But let us investigate the idea of self-evaluation. For it seems that the software itself is capable of having an accurate idea of its possible errors.

In the above example, human evaluation yields a score of 1 – 5/88 = 94.31%. Contrast with self-evaluation which sums its possible errors: unknown words and disambiguation errors, thus entailing a self-evaluation of 92,05%, due to 7 hypothesized errors. In this case, self-evaluation computes the maximum error rate. But even here, there are some false positives: ‘apellation’ is left untranslated, being unrecognized. In effect, the correct spelling is ‘appellation’. To sum up: the software identifies an unknown word (and lefts it untranslated) and counts it as a possible error.

Let us sketch what could be the pros and cons of MT self-evaluation. To begin with, the pros:

  • it could provide a detailed taxonomy of possible errors: unknown words, unresolved grammatical disambiguation, unresolved semantical disambiguation, …
  • it could identify precisely the suspected errors
  • evaluation would be very fast and uncostly
  • self-evaluation would work with whatever text or corpus
  • self-evaluation could pave the way to further self-improvement and self-correction of errors
  • its reliability could be good

And the cons:

  • MT may be unaware of some types of errors, i.e. errors related to expressions and locutions
  • MT may be unaware of some types of errors, i.e. errors related to expressions and locutions
  • MT self-evaluation could be especially blind to grammatical errors
  • it would sometimes count as unknown words some foreign words that should remain untranslated
  • MT would be unaware of erroneous disambiguations

Semantic disambiguation of French ‘femme’: in the mud, gold is still shining

In Corsican language, French word ‘femme’ can be translated, depending on the context

  • either into donna (woman)
  • or into moglia (wife)

The above sample still contains a lot of vocabulary and grammatical disambiguation errors (easy/medium difficulty), but it handles successfully the semantic disambiguation (hard) of ‘femme’, two instances of which are properly translated into moglia (wife). As the Corsican proverb says, in a cianga l’oru luci sempri (in the mud, gold is still shining).

French samples are from the French corpora of the University of Leipzig.

A Special Case of Anaphora Resolution

After improper anaphora resolution

Anaphora resolution usually refers to pronouns. But we face here a special case of anaphora resolution that relates to an adjective. The following sentence: ‘un vase de Chine authentique’ (an authentic vase of China) is translated erroneously as un vasu di China autentica, due to erroneous anaphora resolution. In this sample, the adjective ‘authentique’ refers to ‘vase’ (English: vase) and not to ‘Chine’ (China).

The same goes for ‘une chanson du Portugal mythique’, where ‘mythique’ refers to ‘chanson’ and not to ‘Portugal’.

After appropriate anaphora resolution

Four consecutive ambiguous words

Translating the following sentence: ‘ce fait est unique’ is not as easy as it could seem at first glance. In effect, it is made up of four consecutive ambiguous words:

  • ‘ce’: ‘ssu (demonstrative pronoun, this) or ciò (it, relative pronoun)
  • ‘fait’: fattu (masculine singular noun, fact), fattu (past participe, done) or faci (does, third person singular of the verb to do at the present tense)
  • ‘est’: estu (masculine singular noun, east) or (is, third person singular of the verb to be at the present tense)
  • ‘unique’: unicu (masculine singular adjective, unique in English) or unica (feminine singular adjective, unique in English)