Logo Logo
Hilfe
Kontakt
Switch language to English
Attention-based neural sequence-to-sequence methods for information extraction and text summarization
Attention-based neural sequence-to-sequence methods for information extraction and text summarization
Natural language processing (NLP) is an essential technology in the information age. There is a large variety of machine learning models powering NLP applications. Recently, deep learning approaches have obtained exciting performance across a broad range of NLP tasks. Such models are often trained end-to-end and are more efficient and cost-effective than traditional task-specific feature engineering. In this dissertation, we focus on several NLP tasks using a specific deep learning technique -- sequence-to-sequence (seq2seq). We first explore the fine-grained entity mention classification problem, an instance of naturally occurring hierarchical learning, where an entity mention in a given context or a sentence can have one or more fine-grained types, e.g., Obama is both a politician and an author in a context in which his election is related to his prior success as a best-selling author. We model the structure of the hierarchy more directly than standard approaches of flat and local classification. This reorganization of problem and utilization of seq2seq model has the advantage (compared to prior work of hierarchical entity classification) that our architecture can be trained end-to-end. Experiments show that our model performs better than prior work on the FIGER dataset. Second, we investigate a key problem in information extraction: entity-driven relation extraction. Given a large text corpus, a query entity Q (e.g., Q = "Steve Jackson") and a predefined relational schema, a system has to extract a set of facts from the corpus that are oriented to this schema, e.g., "Q authored notable work with title X" linked to schema class \verb=per:notable_work=. We define the task of Open-Type Relation Argument Extraction (ORAE), where the model has to extract relation arguments without being able to rely on an entity extractor to find the argument candidates. In ORAE, we circumvent the use of additional entity taggers and let the relation extraction module perform this task implicitly. We propose a set of relation extraction models such as traditional CRF-based sequence taggers and seq2seq-based pointer networks. Third, we define the task of teaser generation and provide an evaluation benchmark and baseline systems for the process of generating teasers. A teaser is a short reading suggestion for an article that is illustrative and includes curiosity-arousing elements to entice potential readers to read particular news items. We compile a novel dataset of teasers by systematically accumulating tweets and selecting those that conform to the teaser definition. We compare several seq2seq-based abstractive summarization systems on the task of teaser generation. Fourth, we address summarization of interleaved text in a low-resource setting. Interleaved text refers to the phenomenon of posts belonging to different threads occurring in a sequence. This commonly occurs in online chat posts. It can be time-consuming to quickly obtain an overview of such a discussion. An end-to-end trainable summarization system obviates the need of explicit disentanglement; however, such a system requires a large amount of labeled data. To address this, we propose to pretrain an end-to-end trainable hierarchical seq2seq system using synthetic interleaved texts. We show that by fine-tuning on a real-world meeting dataset (AMI), such a system outperforms a traditional two-step system by 22%. Fifth, we investigate the radiology report summarization task. The Impressions section of a radiology report about an imaging study is a summary of the radiologist’s reasoning and conclusions, and it also aids the referring physician in confirming or excluding certain diagnoses. To automatically generate an abstractive summary of the typical information-rich radiology report requires the acquisition of salient content from the report and the generation of a concise, easily consumable Impressions section. To achieve that, we design a two-step approach: extractive summarization followed by abstractive summarization. We additionally break down the extractive part into two independent tasks: extraction of salient (1) sentences and (2) keywords. We show our novel approach leads to a more precise summary compared to single-step and to two-step-with-single-extractive-process baselines with an overall improvement in F1 score of 3-4%.
Not available
Karn, Sanjeev Kumar
2024
Englisch
Universitätsbibliothek der Ludwig-Maximilians-Universität München
Karn, Sanjeev Kumar (2024): Attention-based neural sequence-to-sequence methods for information extraction and text summarization. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik
[thumbnail of Karn_Sanjeev_Kumar.pdf]
Vorschau
PDF
Karn_Sanjeev_Kumar.pdf

3MB

Abstract

Natural language processing (NLP) is an essential technology in the information age. There is a large variety of machine learning models powering NLP applications. Recently, deep learning approaches have obtained exciting performance across a broad range of NLP tasks. Such models are often trained end-to-end and are more efficient and cost-effective than traditional task-specific feature engineering. In this dissertation, we focus on several NLP tasks using a specific deep learning technique -- sequence-to-sequence (seq2seq). We first explore the fine-grained entity mention classification problem, an instance of naturally occurring hierarchical learning, where an entity mention in a given context or a sentence can have one or more fine-grained types, e.g., Obama is both a politician and an author in a context in which his election is related to his prior success as a best-selling author. We model the structure of the hierarchy more directly than standard approaches of flat and local classification. This reorganization of problem and utilization of seq2seq model has the advantage (compared to prior work of hierarchical entity classification) that our architecture can be trained end-to-end. Experiments show that our model performs better than prior work on the FIGER dataset. Second, we investigate a key problem in information extraction: entity-driven relation extraction. Given a large text corpus, a query entity Q (e.g., Q = "Steve Jackson") and a predefined relational schema, a system has to extract a set of facts from the corpus that are oriented to this schema, e.g., "Q authored notable work with title X" linked to schema class \verb=per:notable_work=. We define the task of Open-Type Relation Argument Extraction (ORAE), where the model has to extract relation arguments without being able to rely on an entity extractor to find the argument candidates. In ORAE, we circumvent the use of additional entity taggers and let the relation extraction module perform this task implicitly. We propose a set of relation extraction models such as traditional CRF-based sequence taggers and seq2seq-based pointer networks. Third, we define the task of teaser generation and provide an evaluation benchmark and baseline systems for the process of generating teasers. A teaser is a short reading suggestion for an article that is illustrative and includes curiosity-arousing elements to entice potential readers to read particular news items. We compile a novel dataset of teasers by systematically accumulating tweets and selecting those that conform to the teaser definition. We compare several seq2seq-based abstractive summarization systems on the task of teaser generation. Fourth, we address summarization of interleaved text in a low-resource setting. Interleaved text refers to the phenomenon of posts belonging to different threads occurring in a sequence. This commonly occurs in online chat posts. It can be time-consuming to quickly obtain an overview of such a discussion. An end-to-end trainable summarization system obviates the need of explicit disentanglement; however, such a system requires a large amount of labeled data. To address this, we propose to pretrain an end-to-end trainable hierarchical seq2seq system using synthetic interleaved texts. We show that by fine-tuning on a real-world meeting dataset (AMI), such a system outperforms a traditional two-step system by 22%. Fifth, we investigate the radiology report summarization task. The Impressions section of a radiology report about an imaging study is a summary of the radiologist’s reasoning and conclusions, and it also aids the referring physician in confirming or excluding certain diagnoses. To automatically generate an abstractive summary of the typical information-rich radiology report requires the acquisition of salient content from the report and the generation of a concise, easily consumable Impressions section. To achieve that, we design a two-step approach: extractive summarization followed by abstractive summarization. We additionally break down the extractive part into two independent tasks: extraction of salient (1) sentences and (2) keywords. We show our novel approach leads to a more precise summary compared to single-step and to two-step-with-single-extractive-process baselines with an overall improvement in F1 score of 3-4%.