COMPUTATIONAL LINGUISTICS AND TURKIC LANGUAGE STUDIES

Orhun Murat *

1. Introduction
In a natural language, there are many different ways to explain an idea and there are hundreds of languages exist in the world today. In some countries, more than one languages have been accepted as official languages. For example, in Canada,

English and French must be used in all official documents. Therefore, definitively it is necessary to translate all documents into both of these languages. While speaking in a meeting or in a con-ference, it is necessary to translate one sentence into other language immediately. And if it is required to translate a longer article, then it takes longer time to get the translated document. Also the quality of the translation is important. Because of these reasons, translations have been very important task from early ages. Scientists have been trying to find a general solution for translations for a long time. With the inventions of computer, language translation and their research have become one of the hot topic in science and this science is called computational linguistics. Both of scientists and linguists tried to translate one language into another with a computer program. The first computer based translation system had been implemented from Russian into English about 60 sentences [Chéragui 2012: 161]. The translation result was a great success and scientists thought it was possible to implement a machine translation system for general pur-pose in 3-5 years. After the real project had been started, the prog-ress of the project was too slow and couldn’t get the expected results after 10 years of research. Because of this reason, the famous AL-PAC report was issued [Hutchins 1995: 431]. With this report USA government had reduced the fund for researching computer based translation or machine translation. Beginning in the late 1980s, com-puter technologies developed better than 1960s and computers with large memory and high speeds are available for less money. Hence, computer based translation or computational linguistics has been become one of the hottest topic of the contemporary science. Ma-chine Translation (MT) is a sub-filed of Natural Language Process-ing (NLP) and NLP is a filed of Artificial Intelligence. The purpose of the machine translation is to translate one natural language into another natural language with a software (computer programs) with-out any help of human. Unfortunately, it is too difficult to implement such a translation system. The main reason is, it is a natural language first and there are hundreds of cases or shapes to explain an idea. For example, idioms, humors, phrases and poems etc. Meanwhile some explanations have been related to cultural and habitual activities or even speaking tones. At the moment it is not possible to implement such a Fully Automatic High Quality (FAHQ) translation system. Though Fully Automatic High Quality translation system not pos-sible, some special machine translation systems have been imple-mented and have been used actively in every life and researches. For example, the Météo, it is a machine translation system was devel-oped for the translation of weather bulletins from English to French issued by the meteorological institutes in Canada [Chandiox 1976: 127-133], the English-Japanese machine translation system of the titles of scientific and engineering papers [Nagao et al., 1982: 245] etc. Except these, some machine translations systems have been im-plemented for general purpose even don’t meet the FAHQ criteria. For example, the RUSLAN system [Hajič 1987: 113], that translates from Czech language to Russian and the CESILKO system [Hajič et al., 2000: 7-12], which translates from Czech language to Slovak language. The translation quality is different according to closeness of different languages. For example, the translation corrects of the RUSLAN system is about 40 percent while the CESILKO system’s is about 90 percent.

This paper supposed to give some brief information of com-puter based translation that related to Uyghur language in order to explain recent development of computational research of Turkic lan-guages. The rest of the paper is organized as follows. The next sec-tion gives some summarization of machine translation that related to Turkic and Uyghur languages. Section three introduces and dis-cusses some problems related to machine translation about Uyghur language. Section four gives a brief conclusion about computational linguistics and their effects on Turkic studies.

2. Related Works
Uyghur language is a Turkic language and it is belonging to the Ural-Altaic language family. Almost all Turkic language have the same grammatical structure except simple difference. One of the main difference between these languages is about new words that those accepted from other languages. Russian words appear in Central Asian Turkic languages while Chinese, Arabic and Persian words appear in Uyghur language and English words appear in Tur-key Turkish etc. Because of these reason, some difference appear when adding suffix or prefixes to a word. In natural language studies, it is the first step to analyze a word correctly with its morphemes. Turkic languages are agglutinative and heavily inflected language. It means a word could take no limited suffixes theoretically and chang-es some characters in order to harmonize vowel and constants. For example:

OSMANLILAŞTIRAMAYABİLECEKLERİMİZDEN-MİŞSİNİZCESİNE

This word can be broken into morphemes as follows:

OSMAN+LI+LAŞ+TIR+AMA+YABİL+ECE-K+LER+İMİZ+DEN+MİŞ+SİNİZ+CESİNE

This is a famous example in Turkish rather exaggerated and it means “as if you were of those whom we might consider not converting into an Ottoman” [Oflazer 1995: 137]. The root of this word is “OSMAN” and the rest of the words a suffixes. Such examples could be find out in other Turkic languages as well. In order to work on a word, it is necessary to understand that word correctly and there are millions of different combinations of words with its possible prefixes and suffixes. While attach-ing a suffixes to a word, that word’s or phrases category will be changed according type of a suffixes. Therefore, this changes will affect the structure of a whole sentence [Oflazer 1995: 138]. In Turkic language family, Turkey Turkish is one of the most studied language in computer science. Turkish language was the first language that its morphology had been analyzed with a computer. Because all Turkish languages belong to the same language family, also they are very closely related to each other, some technical researches could be applied to other Turkic lan-guage with some modification. For example, Turkmen [Tantuğ et al., 2006: 186-193], Crimean Tatar [Altıntaş et al., 2001: 180-189 ], Uyghur [Orhun et al., 2009: 33-43], [Orhun et al., 2009:811-816], Kazakh [Kessikbayeva et al., 2014: 46- 54]] and Qazan Tatar [Gökgöz et al., 2011: 428- 432] language morphological analyzers have been implemented based on the Turkish morpho-logical analyzer [Oflazer 1995: 137-148]. Turkic languages are agglutinative language; therefor usually more than one solution are generated when a word is analyzed. Because of this reason, morphological ambiguity will appear to decide which solution is correct [Oflazer et al., 1996: 69]. For example, the Uyghur word “yazmaqchi” (will write) will be generated following solutions when analyzed with the Uyghur morphological analyzer.
yazmaqci: yaz+Verb+Pos+Fut+A3sg
yazmaqci: yaz+Verb+Pos^DB+Noun+Inf1+A3sg+P-non+Nom^DB
+Adj+Agt
yazmaqci: yaz+Verb+Pos^DB+Noun+Inf1+A3sg+P-non+Nom^DB

+Noun+Agt+A3sg+Pnon+Nomyazmaqci:yaz+Ver-b+Pos+Inten+A3sg

The first solution explains, the root of the word is “yaz”(write), it is a verb, positive, future tense and in third person singular form. The second solution explains, the root is “word”, positive, with adding the “maq” suffix, the word has been become pro-noun, also this pronoun has been become to adjective with the adding the “ci” suffixes. Rest of the solution could be analyzed with the same way. To solve such morphological disambiguation problems, there are some important researches have been done with the fund of government for Turkish language [Oflazer et al., 1996: 69-81], [Hakkani-Tür et al., 2002: 318-410]. After morpho-logical analyzers have been implemented both of the source and target languages and disambiguation problem has been solved, then a simple machine translation system could be implemented. For implementing a translation system, rule based or statistical method can be used, or a hybrid system can be used such as de-scribed in [Tantuğ 2007]. In [Tantuğ et al., 2006: 109-116], a morphological analyzer has been implemented for the Turkmen language first. Because there are some differences between Turk-men and Turkish sentence, some rules have been defined to cor-rectly replace some Turkmen suffixes with Turkish suffixes. Af-ter such rules have been defined, Turkmen root words have been translated into Turkish root word. As a natural language, a word can be translated into another language more than one word. This case rises ambiguation problem. Before get the correct transla-tion, one of the best word should be selected according to sen-tence meaning. For solve this problem, rule based methods can not give good solution. Therefore statistical methods have been suggested that work based on corpora. In order to decide the best word, computers calculate a words frequency.

Fig.1 The process of decoding the most probable target lan-guage sentence [Tantuğ et al., 2006: 114] that may appear a relatively similar sentence in the corpora and selects the word with the high frequency (see Fig.1). In this Figure, the calculation process described to choice three words that translated from Turkmen to Turkish such as. “ne” or “kim”, “in-san” or “adam”, “konuş” or “söyle”. Even Turkish and other Turkic languages are closely related to each other, the Turkish language research results can not be applied to them directly. Hence-forth, Uyghur language has been studied independently and some primary results has been achieved on a system that translates from Uyghur to Turkish [Orhun 2010]. In this translations system, a rule based word sense disambiguation model implemented instead of statisti-cal based methods. Therefore, the calculation speed is higher than statistical methods. For example, the sentence “men qelem aldim” means “I bought a pen” can be analyzed as follows:

men: men+Pron+Pers+A1sg+Pnon+Nom
bir: bir+Num+Card
qelem: qelem+Noun+A3sg+Pnon+Nom (pen)
qelem: qele+Noun+A3sg+P1sg+Nom (my castle)
aldim: aldi+Noun+A3sg+P1sg+Nom

aldim: al+Verb+Pos+Past+A1sg

In this solutions, the word “qelem” has produced two differ-ent solutions. To decide which one is correct, the firs word “men” should be analyzed. The word “qelem” related to subject of the sen-tence “men” and subject doesn’t take any personal suffix “P1sg”. Therefore, the solution result “my castle” will be discarded and oth-er solution considered as a correct one. The drawback is, only limit-ed number of rules have been defined and can not give correct result for not considered cases.

3. Restrictions About Machine Translation
To implemented a machine translation system from Uyghur to Turkish, a morphological analyzer had been implemented as de-scribed in [Orhun et al., 2009: 33-43], [Orhun et al., 2009: 811-816]. With this analyzer, contemporary Uyghur words have been analyzed about 88 percent correctly at this current version. The reason that the correctness not so high is, there are some words that they don’t be-long to Uyghur language originally and they couldn’t have analyzed with general rules. Not only there are some Chinese and English words appear in the contemporary Uyghur language, but also some Persian and Arabic words as well. Another reason is, The Uyghur verbs have very complex structure and there are a lot of auxiliary verbs as well. When ever some suffixes have been attached to a verb, the root word or formed words will be inflected. Also, the auxiliary verbs have not been considered in [Orhun et al., 2009: 811-816]. Therefore, it is still an open topic to be studied. Because the mor-phological problem has not been solved properly, it is difficult to solve the disambiguation problem. Without solving the disambigu-ation problem, it is not possible to get the correct translation of the source language. The system introduced in [Orhun 2010] includes some rules that defined based on classification of the words, which meaning based on morphological analyzes. For example, let us ana-lyze the following Uyghur sentence (he/she will write a letter).

u+Pron+Pers+A3sg+Pnon+Nom xet: xet+Noun+A3sg+Pnon+Nom yazmaqci: yaz+Verb+Pos+Fut+A3sg

yazmaqci: yaz+Verb+Pos^DB+Noun+Inf1+A3sg+P-non+Nom^DB
+Adj+Agt
yazmaqci: yaz+Verb+Pos^DB+Noun+Inf1+A3sg+P-non+Nom^DB

+Noun+Agt+A3sg+Pnon+Nom yazmaqci: yaz+Verb+Pos+Inten+A3sg

After that sentence has been analyzed, the word “yazmaqci” (going to write) will be analyzed with four different solutions. Actu-ally, one solution that is in the “noun”, yaz (summer) form has been eliminated.

yaz: yaz+Noun+A3sg+Pnon+Nom

The reason is, whenever a word attached the suffix “maqci”, then that word is diffidently a verb. Because the “maqci” suffix used create a future tense from a verb. Therefor the solution with the noun property will be eliminated automatically even there is a possibil-ity hat it could be resolved as a noun. If a system suggested that works without a rule definition, then a statistical system must be necessary. Unfortunately, there is not a general corpora exist today for Uyghur language. Though some researches have been started for corpus constructing, it is not available for public or academic research [Aibaidulla et al., 2003: 228-234]. The task constructing a corpus is a very expensive task and it takes a long time. After mor-phological disambiguations have been solved, the translation of the source language will be searched in a bilingual target language and target sentence will be created. In general, all Turkic language have limited number of root words, and other words will be created with adding suffixes. Once root words have been translated correctly, then rest of the words could be accessed with applying different rules or some statistical calcu-lation results. For Turkic languages there is a Treebank corpus exist [Oflazer et al., 2003: 261-277], for the Turkish language only and this is a drawback for implement translation systems between differ-ent Turkic languages.

4. Conclusion
In this paper, some computer based language analyzing meth-ods have been introduced briefly that related to Turkic languages with explaining recent researches about Uyghur language. As a re-sult, there is not a full functional morphological analyzer exist for the Uyghur language. Therefore, it is still early to do large scale computational research on this language. Because of this reason it is not possible to a well implemented machine translation system from or to Uyghur language at the moment. In this Internet age, ma-chine translation is not avoidable tendency. For example, the Google translator is one of the practical example that used in everyday life. Even the Google translator cannot give correct or well structured translation, still it gives brief information about source texts. All Turkic languages are close to each other and if a common corpus is implemented for all all Turkic languages, it is possible to implement a general machine translation system for all Turkic languages.

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *