Character-level and syntax-level models for low-resource and multilingual natural language processing

www.lmu.de | UB | Blättern | FAQ

Zur erweiterten Suche

English

Zur erweiterten Suche

There are more than 7000 languages in the world, but only a small portion of them benefit from Natural Language Processing resources and models. Although languages generally present different characteristics, “cross-lingual bridges” can be exploited, such as transliteration signals and word alignment links. Such information, together with the availability of multiparallel corpora and the urge to overcome language barriers, motivates us to build models that represent more of the world’s languages. This thesis investigates cross-lingual links for improving the processing of low-resource languages with language-agnostic models at the character and syntax level. Specifically, we propose to (i) use orthographic similarities and transliteration between Named Entities and rare words in different languages to improve the construction of Bilingual Word Embeddings (BWEs) and named entity resources, and (ii) exploit multiparallel corpora for projecting labels from high- to low-resource languages, thereby gaining access to weakly supervised processing methods for the latter. In the first publication, we describe our approach for improving the translation of rare words and named entities for the Bilingual Dictionary Induction (BDI) task, using orthography and transliteration information. In our second work, we tackle BDI by enriching BWEs with orthography embeddings and a number of other features, using our classification-based system to overcome script differences among languages. The third publication describes cheap cross-lingual signals that should be considered when building mapping approaches for BWEs since they are simple to extract, effective for bootstrapping the mapping of BWEs, and overcome the failure of unsupervised methods. The fourth paper shows our approach for extracting a named entity resource for 1340 languages, including very low-resource languages from all major areas of linguistic diversity. We exploit parallel corpus statistics and transliteration models and obtain improved performance over prior work. Lastly, the fifth work models annotation projection as a graph-based label propagation problem for the part of speech tagging task. Part of speech models trained on our labeled sets outperform prior work for low-resource languages like Bambara (an African language spoken in Mali), Erzya (a Uralic language spoken in Russia’s Republic of Mordovia), Manx (the Celtic language of the Isle of Man), and Yoruba (a Niger-Congo language spoken in Nigeria and surrounding countries).

natural language processing, multilinguality, machine learning, transliteration, bilingual dictionary induction, bilingual word embeddings

Severini, Silvia

05. Jul. 2023

2023

Englisch

Universitätsbibliothek der Ludwig-Maximilians-Universität München

https://nbn-resolving.org/urn:nbn:de:bvb:19-320942

Severini, Silvia (2023): Character-level and syntax-level models for low-resource and multilingual natural language processing. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik

Vorschau

Lizenz: Creative Commons: Namensnennung 4.0 (CC-BY)
PDF
Severini_Silvia.pdf
3MB

DOI: 10.5282/edoc.32094

URN: urn:nbn:de:bvb:19-320942

Abstract

Dokumententyp:	Dissertationen (Dissertation, LMU München)
Keywords:	natural language processing, multilinguality, machine learning, transliteration, bilingual dictionary induction, bilingual word embeddings
Themengebiete:	000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik
Fakultäten:	Fakultät für Mathematik, Informatik und Statistik
Sprache der Hochschulschrift:	Englisch
Datum der mündlichen Prüfung:	5. Juli 2023
1. Berichterstatter:in:	Schütze, Hinrich
MD5 Prüfsumme der PDF-Datei:	6976c878f998c416ffbb0a965e70e341
Signatur der gedruckten Ausgabe:	0001/UMC 29731
ID Code:	32094
Eingestellt am:	26. Jul. 2023 13:29
Letzte Änderungen:	14. Aug. 2023 12:36

Nur für Administratoren und Editoren: Dokument bearbeiten