Opus bitext and monolingual data

#Opus bitext and monolingual data manual#
#Opus bitext and monolingual data full#

In the case of Arabic dialects, a further complication arises due to the informal status of the dialects, which are not standardized and not used in formal contexts but rather only in informal online media such as social networks, chats, forums, Twitter, and SMS messages, though the Egyptian Wikipedia is one notable exception.

#Opus bitext and monolingual data manual#

Here again, manual rules and/or language-specific tools and resources are typically used. A special case of this same line of research is the translation between dialects of the same language, for example, between Cantonese and Mandarin (Zhang 1998), or between a dialect of a language and a standard version of that language, for example, between some Arabic dialect (e.g., Egyptian) and Modern Standard Arabic (Bakr, Shaalan, and Ziedan 2008 Sawaf 2010 Salloum and Habash 2011 Sajjad, Darwish, and Belinkov 2013).

#Opus bitext and monolingual data full#

In contrast, we have a different objective: We do not carry out full translation but rather adaptation (since our ultimate goal is to translate into a third language X). This has been tried for a number of language pairs including Czech–Slovak (Hajič, Hric, and Kuboň 2000), Turkish–Crimean Tatar (Altintas and Cicekli 2002), Irish–Scottish Gaelic (Scannell 2006), and Macedonian–Bulgarian (Nakov and Tiedemann 2012). One relevant line of research is on machine translation between closely related languages, which is arguably simpler than general SMT, and thus can be handled using word-for-word translation, manual language-specific rules that take care of the necessary morphological and syntactic transformations, or character-level translation/ transliteration.

We also demonstrate the applicability of our approaches to other languages and domains. Moreover, combining the small POOR–TGT bitext with the adapted bitext outperforms the corresponding combinations with the unadapted bitext by 1.93–3.25 BLEU points. Our experiments for Indonesian/Malay–English translation show that using the large adapted resource-rich bitext yields 7.26 BLEU points of improvement over the unadapted one and 3.09 BLEU points over the original small bitext. Our work is of importance for resource-poor machine translation because it can provide a useful guideline for people building machine translation systems for resource-poor languages. We assume a small POOR–TGT bitext from which we learn word-level and phrase-level paraphrases and cross-lingual morphological variants between the resource-rich and the resource-poor language. Specifically, we build improved statistical machine translation models from a resource-poor language POOR into a target language TGT by adapting and using a large bitext for a related resource-rich language RICH and the same target language TGT. Thus, we propose three novel, language-independent approaches to source language adaptation for resource-poor statistical machine translation. Most of the world languages are resource-poor for statistical machine translation still, many of them are actually related to some resource-rich language.