Severini, Silvia (2023): Character-level and syntax-level models for low-resource and multilingual natural language processing. Dissertation, LMU München: Faculty of Mathematics, Computer Science and Statistics |
Preview |
Licence: Creative Commons: Attribution 4.0 (CC-BY) Severini_Silvia.pdf 3MB |
Abstract
There are more than 7000 languages in the world, but only a small portion of them benefit from Natural Language Processing resources and models. Although languages generally present different characteristics, “cross-lingual bridges” can be exploited, such as transliteration signals and word alignment links. Such information, together with the availability of multiparallel corpora and the urge to overcome language barriers, motivates us to build models that represent more of the world’s languages. This thesis investigates cross-lingual links for improving the processing of low-resource languages with language-agnostic models at the character and syntax level. Specifically, we propose to (i) use orthographic similarities and transliteration between Named Entities and rare words in different languages to improve the construction of Bilingual Word Embeddings (BWEs) and named entity resources, and (ii) exploit multiparallel corpora for projecting labels from high- to low-resource languages, thereby gaining access to weakly supervised processing methods for the latter. In the first publication, we describe our approach for improving the translation of rare words and named entities for the Bilingual Dictionary Induction (BDI) task, using orthography and transliteration information. In our second work, we tackle BDI by enriching BWEs with orthography embeddings and a number of other features, using our classification-based system to overcome script differences among languages. The third publication describes cheap cross-lingual signals that should be considered when building mapping approaches for BWEs since they are simple to extract, effective for bootstrapping the mapping of BWEs, and overcome the failure of unsupervised methods. The fourth paper shows our approach for extracting a named entity resource for 1340 languages, including very low-resource languages from all major areas of linguistic diversity. We exploit parallel corpus statistics and transliteration models and obtain improved performance over prior work. Lastly, the fifth work models annotation projection as a graph-based label propagation problem for the part of speech tagging task. Part of speech models trained on our labeled sets outperform prior work for low-resource languages like Bambara (an African language spoken in Mali), Erzya (a Uralic language spoken in Russia’s Republic of Mordovia), Manx (the Celtic language of the Isle of Man), and Yoruba (a Niger-Congo language spoken in Nigeria and surrounding countries).
Item Type: | Theses (Dissertation, LMU Munich) |
---|---|
Keywords: | natural language processing, multilinguality, machine learning, transliteration, bilingual dictionary induction, bilingual word embeddings |
Subjects: | 000 Computers, Information and General Reference > 004 Data processing computer science |
Faculties: | Faculty of Mathematics, Computer Science and Statistics |
Language: | English |
Date of oral examination: | 5. July 2023 |
1. Referee: | Schütze, Hinrich |
MD5 Checksum of the PDF-file: | 6976c878f998c416ffbb0a965e70e341 |
Signature of the printed copy: | 0001/UMC 29731 |
ID Code: | 32094 |
Deposited On: | 26. Jul 2023 13:29 |
Last Modified: | 14. Aug 2023 12:36 |