Distributed representations for multilingual language processing

www.lmu.de | UB | Blättern | FAQ

Zur erweiterten Suche

English

Zur erweiterten Suche

Distributed representations are a central element in natural language processing. Units of text such as words, ngrams, or characters are mapped to real-valued vectors so that they can be processed by computational models. Representations trained on large amounts of text, called static word embeddings, have been found to work well across a variety of tasks such as sentiment analysis or named entity recognition. More recently, pretrained language models are used as contextualized representations that have been found to yield even better task performances. Multilingual representations that are invariant with respect to languages are useful for multiple reasons. Models using those representations would only require training data in one language and still generalize across multiple languages. This is especially useful for languages that exhibit data sparsity. Further, machine translation models can benefit from source and target representations in the same space. Last, knowledge extraction models could not only access English data, but data in any natural language and thus exploit a richer source of knowledge. Given that several thousand languages exist in the world, the need for multilingual language processing seems evident. However, it is not immediately clear, which properties multilingual embeddings should exhibit, how current multilingual representations work and how they could be improved. This thesis investigates some of these questions. In the first publication, we explore the boundaries of multilingual representation learning by creating an embedding space across more than one thousand languages. We analyze existing methods and propose concept based embedding learning methods. The second paper investigates differences between creating representations for one thousand languages with little data versus considering few languages with abundant data. In the third publication, we refine a method to obtain interpretable subspaces of embeddings. This method can be used to investigate the workings of multilingual representations. The fourth publication finds that multilingual pretrained language models exhibit a high degree of multilinguality in the sense that high quality word alignments can be easily extracted. The fifth paper investigates reasons why multilingual pretrained language models are multilingual despite lacking any kind of crosslingual supervision during training. Based on our findings we propose a training scheme that leads to improved multilinguality. Last, the sixth paper investigates the use of multilingual pretrained language models as multilingual knowledge bases.

natural language processing, representation learning, multilinguality, machine learning, word embeddings

Dufter, Philipp

28. Apr. 2021

2021

Englisch

Universitätsbibliothek der Ludwig-Maximilians-Universität München

https://nbn-resolving.org/urn:nbn:de:bvb:19-280144

Dufter, Philipp (2021): Distributed representations for multilingual language processing. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik

Vorschau

Lizenz: Creative Commons: Namensnennung 4.0 (CC-BY)
PDF
Dufter_Philipp.pdf
4MB

DOI: 10.5282/edoc.28014

URN: urn:nbn:de:bvb:19-280144

Abstract

Dokumententyp:	Dissertationen (Dissertation, LMU München)
Keywords:	natural language processing, representation learning, multilinguality, machine learning, word embeddings
Themengebiete:	000 Allgemeines, Informatik, Informationswissenschaft 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik
Fakultäten:	Fakultät für Mathematik, Informatik und Statistik
Sprache der Hochschulschrift:	Englisch
Datum der mündlichen Prüfung:	28. April 2021
1. Berichterstatter:in:	Schütze, Hinrich
MD5 Prüfsumme der PDF-Datei:	1a1b6e2a6a72c53d76b717e2cef4a6d7
Signatur der gedruckten Ausgabe:	0001/UMC 27933
ID Code:	28014
Eingestellt am:	27. May 2021 10:08
Letzte Änderungen:	27. May 2021 10:08

Nur für Administratoren und Editoren: Dokument bearbeiten