Lexicons for Low-Resource Languages
For the source language a lexicon is needed which can be referenced for every word in the text that is to be translated. A lexicon gives us a lot of useful information about a word such as its grammatical category or part of speech. For the target language a lexicon is also needed since words from the source text need to end up as valid target words. In other words, the target language lexicon will help determine if words are well-formed. A resource like a lexicon may not be easy to obtain in a low-resource language. The linguist may need to create it if there has never been an electronic lexicon before. How can a person create a lexicon in a language that has never had one? It is a big job and requires extensive help from the language community. What tool can be used to hold the lexical data? There are many tools available, but it would be ideal if it is a tool that is commonly used by linguists for lexicography. When thinking about it from an MT perspective it should be a kind of database. A database that is highly normalized — where consistency of the data is maintained.
For the MT task, only a minimal amount of information is really necessary in a lexicon. Just having the form of the word, a lemma and a grammatical category might be sufficient, but a linguist and the language community might be interested in much more than this. They might want to have a gloss or definition of the word in another language at a minimum and there are many more useful things such as example sentences, synonyms, antonyms and notes. It would be a huge advantage to have a lexicon tool that can handle all this information and at the same time serve as a repository for words in an MT system. A bilingual lexicon where source words are linked to target words is also a needed resource. A bilingual lexicon is very important because it tells the transfer process how to convert a source word to a target word. A bilingual lexicon needs to be able to deal with homographs (words that have the same written form but different meanings) and be able to distinguish between them. When we consider linking words together we must consider at what level we are linking them. A bilingual lexicon can link just the forms, it can link form plus grammatical category or it can link at the level of senses of words. Having more granularity in how linking is done between source and target languages is preferable. The higher the degree of granularity there is, the higher chance there is to resolve ambiguity