index page project for the online okchakko translator

Today we are launching a project on github to write a better index page for the okchakko translator project.

The current index page is located at the following address.

This index page gives online free access to the translation from French to Corsican, a language threatened with extinction.

The current index page has several defects:

  • it is basic, rather crude in its design on a white background
  • the source-text and the destination-text should be aligned horizontally (like Google translate, Deepl, etc.) and not vertically

The index page index.php will be published under the MIT license.

Your contributions are welcome. You can help this project by proposing a better index page that the current one of the okchakko project (a priori in php).

The 90% rule

The translation from French to Gallurese is in progress and currently under development. An application for Android is first planned. It will be called ‘traducidori gaddhuresu’. Currently the French-Gallurese translator is undergoing testing. It will only be published if its performance (evaluated by an open test) is above 90%. This is a rule that we apply to ourselves, and is specific to endangered languages. We consider that for them, a poor or low quality translation can be more harmful than useful.

Characteristics of an AGI (artificial general intelligence)

What are the characteristics we want for an AGI (artificial general intelligence)? An AGI should have a very advanced capacity in NLP and language comprehension. One of the qualities we expect from an AGI is respect for multilingualism. Hopefully, the AGI should have extensive NLP capabilities, which apply to a large number of languages, and even to the 8000 languages of the planet, i.e. also to the 90% of endangered languages. The AGI could thus help to solve an important problem inherent to the problem of language extinction, which affects human cultural diversity (it can be assumed that some languages will be extinct at the time of the AGI event, but the AGI could thus help to revitalize them).

Priority pairs regarding endangered languages

There exists priority translation pairs, from the standpoint of endangered languages. Such notion of a priority pair (the most useful pair for the current users of the endangered language), regarding a given endangered language. For example, French to Corsican is a priority pair, with respect to other pairs suchas Gallurese-Corsican, English-Corsican or Spanish-Corsican. In this context, any endangered language has its own priority pair. For example, a priority pair for sardinian gallurese is Italian-Gallurese. In the same way, a priority pair for sardinian sassarese is Italian-Sassarese. In an analogous way, a priority pair for sicilian language is Italian-Sicilian.

Why rule-based translation is (presently) best suited to endangered languages

Here are some arguments in favor of the choice of rule-based translation concerning machine translation of endangered languages (it relates to the philosophy of language policy):

  • there does not exist at present time a reliable corpus between the given endangered language and other languages
  • endangered languages are often polynomic, i.e. there exist some main variants of the language that coexist: it is important to preserve them since (i) it is a feature of diversity and (ii) it is an inherent feature of the given endangered language, and to distinguish between these variants. In addition, any translation should not contain a mix up of these variants. This also complicates the process of building a proper corpus, since the scarce existing corpus is made up of different variants of the language.
  • in the lack of an adequate corpus, statistical machine translation is not able to provide quality translation of the given endangered language (while on the other hand it succeeds with common languages where excellent corpora are available): arguably, providing low quality translation (although the attempt is meritable) could harm these endangered languages that are by definition vulnerable, since people could use and diffuse the resulting low quality translation. On those grounds, given this vulnerability, it could be argued that a minimum 80% quality translation is needed for a given pair involving an endangered language.
  • in addition, it should be pointed out that endangered languages are usually in a ‘diglossic’ relationship with another language: what is needed as a matter of priority is to provide translation between the two languages of this pair

(to be continued)