A Corpus-based Approach to the Chinese Word Segmentation

www.lmu.de | UB | Blättern | FAQ

Zur erweiterten Suche

English

Zur erweiterten Suche

For a society based upon laws and reason, it has become too easy for us to believe that we live in a world without them. And given that our linguistics wisdom was originally motivated by the search for rules, it seems strange that we now consider these rules to be the exceptions and take exceptions as the norm. The current task of contemporary computational linguistics is to describe these exceptions. In particular, it suffices for most language processing needs, to just describe the argument and predicate within an elementary sentence, under the framework of local grammar. Therefore, a corpus-based approach to the Chinese Word Segmentation problem is proposed, as the first step towards a local grammar for the Chinese language. The two main issues with existing lexicon-based approaches are (a) the classification of unknown character sequences, i.e. sequences that are not listed in the lexicon, and (b) the disambiguation of situations where two candidate words overlap. For (a), we propose an automatic method of enriching the lexicon by comparing candidate sequences to occurrences of the same strings in a manually segmented reference corpus, and using methods of machine learning to select the optimal segmentation for them. These methods are developed in the course of the thesis specifically for this task. The possibility of applying these machine learning method will be discussed in NP-extraction and alignment domain. (b) is approached by designing a general processing framework for Chinese text, which will be called multi-level processing. Under this framework, sentences are recursively split into fragments, according to a language-specific, but domainindependent heuristics. The resulting fragments then define the ultimate boundaries between candidate words and therefore resolve any segmentation ambiguity caused by overlapping sequences. A new shallow semantical annotation is also proposed under the frame work of multi-level processing. A word segmentation algorithm based on these principles has been implemented and tested; results of the evaluation are given and compared to the performance of previous approaches as reported in the literature. The first chapter of this thesis discusses the goals of segmentation and introduces some background concepts. The second chapter analyses the current state-of-theart approach to Chinese language segmentation. Chapter 3 proposes a new corpusbased approach to the identification of unknown words. In chapter 4, a new shallow semantical annotation is also proposed under the framework of multi-level processing.

chinese word segmentation

Liu, Lezhong

05. Jul. 2005

2005

Englisch

Universitätsbibliothek der Ludwig-Maximilians-Universität München

https://nbn-resolving.org/urn:nbn:de:bvb:19-56621

Liu, Lezhong (2005): A Corpus-based Approach to the Chinese Word Segmentation. Dissertation, LMU München: Fakultät für Sprach- und Literaturwissenschaften

Vorschau

PDF
Liu_Lezhong.pdf
1MB

DOI: 10.5282/edoc.5662

URN: urn:nbn:de:bvb:19-56621

Abstract

Dokumententyp:	Dissertationen (Dissertation, LMU München)
Keywords:	chinese word segmentation
Themengebiete:	400 Sprache > 490 Andere Sprachen 400 Sprache
Fakultäten:	Fakultät für Sprach- und Literaturwissenschaften
Sprache der Hochschulschrift:	Englisch
Datum der mündlichen Prüfung:	5. Juli 2005
1. Berichterstatter:in:	Guenthner, Franz
MD5 Prüfsumme der PDF-Datei:	8d6414ef6915e44f2510b0c2e7dff7f8
Signatur der gedruckten Ausgabe:	0001/UMC 15552
ID Code:	5662
Eingestellt am:	07. Aug. 2006
Letzte Änderungen:	24. Oct. 2020 09:20

Nur für Administratoren und Editoren: Dokument bearbeiten