The main challenge for alignment is that sentences/segments do not necessarily map one to-one, and there are many possible atterns for misalignment, as will be shown in the following section. The bottom line in the alignment process is to dentify certain cues, from which it would be possible to tell which segments align to which. Among the most obvious cues are he sentence lengths (the number of characters or words in the sentence); where shorter source sentences align to shorter arget sentences and longer source sentences to longer target ones. However, some factors may inhibit the effectiveness of the ength criteria; e.g. consecutive sentences with similar lengths, inconsistent length distributions (such as when expanding n acronym). Also, it can also be the case that source sentences and target sentences follow the same length distribution but hey are not actually translation of one another (as in the case for alphabetical ordering of each set of segments). Alternatively, sing lexical cues can help provide more confidence about the sentences being more likely translations of one another. The ajor drawback cited by almost all lexical-based approaches is that there are heavy processing requirements and the alignment rocess is generally much slower than the length based approaches. The problem also with lexical cues is that they are not lways available, where they should be available in machine readable format (Machine Readable Bilingual Dictionaries) (Melamed, 1996). Even with the availability of such dictionaries/lexicons, it may often be the case that the words in the source egments are context sensitive or are within idiomatic expressions so their typical corresponding words will either be absent rom the correct target segment or they would map to wrong segments. This is in addition to the typical problem that there will e many consecutive sentences which do not have words within the lexicon, so they may be described as “text deserts”, where here are no cues to know which segment map to which.In addition to the above mentioned challenges, there are more specific challenges within English-Arabic pairs, and also with United Nations documents.
1- Arabic morphology: The affixation system in Arabic is not straightforward, as we can see from the example in table 1, where one English word can correspond to many Arabic tokens, which are essentially various forms of the same word.
Complexity of Arabic Morphology
Submitted a report Taqriran/ تقریرا
His report Taqrirahu/تقریر ه
Her report Tarqiraha/تقریر ها
And their report wTaqrirahum/وتقریر هم
In my report bTaqriri/بتقریری
And to our report wlTaqrirana/ولتقریرنا
The report alTaqrir/التقریر
So, if the word pair in our lexicon is Report:Taqrir/تقریر , we will not be able to match the
other forms. So, the challenge is to be able to systematically stem any word consistently
to its base form.
2- Arabic Orthography: Some lexical-based approaches rely heavily on cognates to substitute/complement the use of lexical cues; however, this applies mainly to similar language pairs; e.g. English-French, but not to languages with completely different orthography, such as English-Arabic. 3- Arabic word order: Some geometric approaches ssume the correct alignment would have the words/tokens in the most linear fashion. owever, this is not the case in English-Arabic pairs, on two counts at least: – English entences follow the Subject-Verb-Object (SVO) order, while Arabic sentences typically ollow Verb-Subject-Object (VSO) order.
– English Adjective Phrases are the exact opposite order of the corresponding Arabic
Phrase, as in the below example (notice that Arabic text goes from right to left):
(1) General (2) Temporary (3) Assistance العامه ( 1) الموقۀ ( 2) المساعده )3)
Word order Difference between English and Arabic
General (1) Temporary (2) Assistance (3)
المساعده ( 3( المساعده ( 2) الموقۀ ( 1) العامه
4- Arabic length considerations: While typically Arabic sentences are shorter than English entences, there maybe certain situations where the Arabic sentence is considerably longer, as shown below: Cases where English sentences are longer than rabic sentences Case English Phrase Arabic Equivalent
For certain new terminology Gender Mainstreaming تعمیم مراعاه المنظوره الجنسانی
For acronyms UNDP برنامج الأمم المتحد الانمائی
United Nations Specific Challenges
There are many editorial considerations within the United Nations documents that cause and exacerbate the problem of misalignment, for example: 1- Alphabetical listing: ountries (and other entities) are typically sorted according to their alphabetical order. This eans that their order in each language is different. 2- Sections displacement: this can also e dependent on the alphabetical sorting of the section header. Some alignment pproaches assume that segments IDs are continuously increasing, while in situations like hese it can be the case that as we progress with the source segments and find ncreasing target segments we may encounter a new section that is at earlier part of the ocument and hence has lower target segments IDs, which can be shown in the negative ffset pattern, a misalignment patterns in the following section.