| Weißweiler, Leonie (2024): Computational approaches to construction grammar and morphology. Dissertation, LMU München: Fakultät für Sprach- und Literaturwissenschaften |
Vorschau |
PDF
weissweiler_leonie_alexandra.pdf 8MB |
Abstract
For the past 100 years, there has been a debate in Linguistics and Natural Language Processing (NLP) over the mechanisms underlying human linguistic capabilities, and the best methods to represent them computationally. Pretrained Language models (PLMs) have even been proposed as proxies that are easier to study than language processing in the human mind, but first, it will be necessary to assess how well they currently model language, and to investigate the mechanisms by which they do it. This thesis proposes to do so with diverse and novel methodology from Linguistics, enabling us to target rarer and less compositional phenomena, which may challenge the models. To develop methods for evaluating PLMs' linguistic capabilities, we first propose to evaluate their ability to represent and learn constructions. Constructions are form-meaning pairings at any level of granularity. A classic example of a well-described construction is the English Comparative Correlative, i.e. "The X-er, the Y-er". We develop novel probing and evaluation methods, and show that modern PLMs have mostly acquired the syntactic structure of constructions, but even state-of-the-art large PLMs struggle with the non-compositional meaning attached to them. We also evaluate PLM's ability for morphological generalisation, which is the process of applying some learned pattern to the formation of new words. We find that while PLMs are remarkably human-like in their generalisation to novel words, they still make errors and rely on different mechanisms than humans. These results show that while large PLMs have come remarkably close to human linguistic capabilities, we can still find areas where improvement is necessary. Examining what modern NLP can contribute to Linguistics, we first tackle the lack of annotated data for Construction Grammar (CxG). As it is currently not possible to fully automatically annotate or parse constructions, we propose human-in-the-loop strategies to aid linguists in creating corpora. We show the results of a community project to introduce a CxG layer into the Universal Dependencies treebanks. We further develop a hybrid annotation pipeline that uses large LMs to reduce human annotation effort, therefore enabling the cost-efficient creation of corpora for very rare phenomena. Lastly, we show how highly parallel corpora can be used for the unsupervised induction of morphological structure for low-resource languages.
| Dokumententyp: | Dissertationen (Dissertation, LMU München) |
|---|---|
| Themengebiete: | 400 Sprache
400 Sprache > 410 Linguistik |
| Fakultäten: | Fakultät für Sprach- und Literaturwissenschaften |
| Sprache der Hochschulschrift: | Englisch |
| Datum der mündlichen Prüfung: | 3. Juli 2024 |
| 1. Berichterstatter:in: | Schütze, Hinrich |
| MD5 Prüfsumme der PDF-Datei: | 5e2e37305ec883ef8812e33f47a4343f |
| Signatur der gedruckten Ausgabe: | 0001/UMC 31564 |
| ID Code: | 35935 |
| Eingestellt am: | 12. Nov. 2025 13:12 |
| Letzte Änderungen: | 12. Nov. 2025 13:13 |