Logo Logo
Hilfe
Kontakt
Switch language to English
Improving annotation quality: empirical insights into bias, human-AI collaboration, and workflow design
Improving annotation quality: empirical insights into bias, human-AI collaboration, and workflow design
High-quality annotated datasets are essential for training machine learning (ML) models. Annotation means assigning a label (such as a category, sentiment score, or classification) to an instance, for example to a piece of text, an image, or a PDF file. Even as training algorithms continue to improve, a model’s real-world performance remains limited by the quality of the training data. While there are many approaches for processing training data, relatively little attention within the ML field has been devoted to annotation quality and the development of best practices for data collection. This thesis contributes to the field through empirical assessments of annotation bias and its implications for training data quality. It further proposes and evaluates strategies to mitigate such biases and enhance annotation outcomes. In addition, it explores the role of large language models (LLMs) in annotation workflows by experimentally assessing their use in fully automated and human-assisted hybrid annotation pipelines. The introductory part outlines the research questions and motivates the overall contributions. As part of this, the background chapter provides a review of the literature on factors influencing annotation quality, organized along two main dimensions: Annotator-related factors encompass individual-level traits and behaviors that may be correlated with annotation behavior. Annotation data collection strategies refer to all design-related decisions made when setting up a task, such as the selection of examples provided in the instructions, task length, or payment. In addition, challenges and opportunities of automating annotation are discussed. Annotation is a structured task that follows standardized procedures for data collection, typically involving a stimulus and fixed response options, much like data collection in fields such as survey methodology and social psychology. In the first and second study, we investigate whether well-known sources of bias identified in these fields also apply to annotation tasks. The first study presents experimental results from a large sample of annotators. We analyze task structure and demographic effects in a hate speech sentiment annotation task, systematically varying the screen design to measure its effect on the resulting labels. In addition, we collect demographic characteristics, task perception metrics, and paradata to assess their relationship with label assignment. Most notably, annotation behavior was significantly influenced by whether classification tasks appeared on a single screen or were split across two, as well as by the annotator’s first language. The second study extends this project by examining whether annotation behavior changes over the course of the task. It estimates how the likelihood of assigning a label evolves with the number of previously completed annotations. As the task progressed, labeling a statement as hateful or offensive became significantly less likely, though the effect was small in magnitude. Together, these studies show that annotations are sensitive to both who performs them and how the task is structured. The third and fourth study explore the potential of real-time, low-cost automated annotations generated by LLMs and their interaction with human annotators. In the third study, we conduct a cost-benefit analysis comparing different types of human and automated annotators in a satellite image annotation task. It includes initial attempts to combine human and LLM-generated annotations. We observe strong potential for cost reduction and quality retention, with less need for expert annotators – especially when leveraging the LLM’s self-reported uncertainty. The fourth study builds on this study by documenting a pipeline for generating and curating a gold-standard validation dataset for CO2 emission values extracted from PDF documents. It demonstrates a feasible approach to integrating automated components to reduce the workload of human domain experts. Even in this highly specialized task, combining LLM annotations with non-expert adjudication can substantially reduce reliance on domain experts. The fifth study investigates the risks and implications of increasing automation in annotation workflows, particularly pre-annotations generated by artificial intelligence (AI). We simulate an AI-assisted scenario by presenting annotators with pre-annotations framed as AI-generated, to examine cognitive bias during adjudication. Notably, those who reported greater skepticism toward AI were more accurate in adjudicating the pre-annotations. Additionally, we observe that annotators are less likely to correct pre-annotations when flagging an error requires providing a corrected value. Across its five contributions, this dissertation advances the field of annotation data collection methods by identifying bias in human, automated, and hybrid annotation setups. It proposes and evaluates multiple solutions and offers guidance for both research and practical annotation tasks. A consistent focus is placed on integrating insights and theories from various academic disciplines to benefit from a broad range of existing findings.
Not available
Beck, Jacob
2025
Englisch
Universitätsbibliothek der Ludwig-Maximilians-Universität München
Beck, Jacob (2025): Improving annotation quality: empirical insights into bias, human-AI collaboration, and workflow design. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik
[thumbnail of Beck_Jacob.pdf]
Vorschau
PDF
Beck_Jacob.pdf

1MB

Abstract

High-quality annotated datasets are essential for training machine learning (ML) models. Annotation means assigning a label (such as a category, sentiment score, or classification) to an instance, for example to a piece of text, an image, or a PDF file. Even as training algorithms continue to improve, a model’s real-world performance remains limited by the quality of the training data. While there are many approaches for processing training data, relatively little attention within the ML field has been devoted to annotation quality and the development of best practices for data collection. This thesis contributes to the field through empirical assessments of annotation bias and its implications for training data quality. It further proposes and evaluates strategies to mitigate such biases and enhance annotation outcomes. In addition, it explores the role of large language models (LLMs) in annotation workflows by experimentally assessing their use in fully automated and human-assisted hybrid annotation pipelines. The introductory part outlines the research questions and motivates the overall contributions. As part of this, the background chapter provides a review of the literature on factors influencing annotation quality, organized along two main dimensions: Annotator-related factors encompass individual-level traits and behaviors that may be correlated with annotation behavior. Annotation data collection strategies refer to all design-related decisions made when setting up a task, such as the selection of examples provided in the instructions, task length, or payment. In addition, challenges and opportunities of automating annotation are discussed. Annotation is a structured task that follows standardized procedures for data collection, typically involving a stimulus and fixed response options, much like data collection in fields such as survey methodology and social psychology. In the first and second study, we investigate whether well-known sources of bias identified in these fields also apply to annotation tasks. The first study presents experimental results from a large sample of annotators. We analyze task structure and demographic effects in a hate speech sentiment annotation task, systematically varying the screen design to measure its effect on the resulting labels. In addition, we collect demographic characteristics, task perception metrics, and paradata to assess their relationship with label assignment. Most notably, annotation behavior was significantly influenced by whether classification tasks appeared on a single screen or were split across two, as well as by the annotator’s first language. The second study extends this project by examining whether annotation behavior changes over the course of the task. It estimates how the likelihood of assigning a label evolves with the number of previously completed annotations. As the task progressed, labeling a statement as hateful or offensive became significantly less likely, though the effect was small in magnitude. Together, these studies show that annotations are sensitive to both who performs them and how the task is structured. The third and fourth study explore the potential of real-time, low-cost automated annotations generated by LLMs and their interaction with human annotators. In the third study, we conduct a cost-benefit analysis comparing different types of human and automated annotators in a satellite image annotation task. It includes initial attempts to combine human and LLM-generated annotations. We observe strong potential for cost reduction and quality retention, with less need for expert annotators – especially when leveraging the LLM’s self-reported uncertainty. The fourth study builds on this study by documenting a pipeline for generating and curating a gold-standard validation dataset for CO2 emission values extracted from PDF documents. It demonstrates a feasible approach to integrating automated components to reduce the workload of human domain experts. Even in this highly specialized task, combining LLM annotations with non-expert adjudication can substantially reduce reliance on domain experts. The fifth study investigates the risks and implications of increasing automation in annotation workflows, particularly pre-annotations generated by artificial intelligence (AI). We simulate an AI-assisted scenario by presenting annotators with pre-annotations framed as AI-generated, to examine cognitive bias during adjudication. Notably, those who reported greater skepticism toward AI were more accurate in adjudicating the pre-annotations. Additionally, we observe that annotators are less likely to correct pre-annotations when flagging an error requires providing a corrected value. Across its five contributions, this dissertation advances the field of annotation data collection methods by identifying bias in human, automated, and hybrid annotation setups. It proposes and evaluates multiple solutions and offers guidance for both research and practical annotation tasks. A consistent focus is placed on integrating insights and theories from various academic disciplines to benefit from a broad range of existing findings.