Logo Logo
Hilfe
Kontakt
Switch language to English
Advances in deep active learning and synergies with semi-supervision
Advances in deep active learning and synergies with semi-supervision
Supervised deep learning models have successfully enabled the automation of processes and the discovery of valuable insights within large datasets, exceeding the capabilities of humans in analyzing and managing the constantly growing volumes of data. However, the effectiveness of these models largely depends on the availability of sufficient high-quality annotated training data. While collecting unlabeled data is often possible with comparably low effort, labeling it is laborious, time-consuming, and costly. In many domains, such as medical or industrial applications, providing accurate annotations requires specialized expertise, which is both scarce and expensive. It is, therefore, essential to reduce the necessity for manual labeling wherever possible. In this thesis, we address the challenge of insufficient and costly annotations. In particular, we contribute to the field of deep active learning. Unlike traditional approaches that passively rely on pre-labeled data, active learning employs an iterative process alternating between training and labeling. By utilizing the model to decide which instances are most useful for its learning process, the performance is enhanced with a smaller amount of labeled data. Semi-supervised learning is a related field dealing with limited labeled data, which aims to improve models by leveraging both labeled and unlabeled data. Our contributions include new methods and insights into active learning as well as its combination with semi-supervised learning to exploit the strength of both. Modern deep active learning strategies typically combine model uncertainty with sample diversity to avoid labeling data with redundant information. However, ensuring diversity by calculating distances in learned representations is computationally expensive, particularly for complex, high-dimensional neural networks. Our first contributions address this limitation. We propose using the prediction probabilities to simultaneously select diverse and uncertain instances, substantially accelerating query selection and returning a qualitative query set. Our method proves effective for both tabular and image classification, being superior to competitors in label and time efficiency. Our next contribution focuses on active learning for node classification. The edges in a graph provide valuable insights into both the importance of individual nodes and the overall graph structure. Hence, it is essential to consider them when actively selecting the most useful instances for labeling. In our work, we introduce a novel active learning method for node classification that leverages diffusion-based graph heuristics in multiple ways for graph learning as well as actively querying nodes for labeling. In contrast to existing methods, our approach demonstrates robust performance across diverse datasets and consistently surpasses random sampling. Moreover, due to pre-computations, it is faster than competitors. Finally, we turn our attention to the task of image classification with a particular focus on the combination of techniques from semi-supervised learning and active learning. Our first contribution in this domain proposes a novel active pseudo-labeling approach. We show that false pseudo-labels often occur during the initial iterations where label information is particularly sparse, resulting in long-term negative effects due to confirmation bias. To mitigate this, we propose a solution to refine the pseudo-labels produced by a model based on their consistency with predictions of a second model, considerably improving prediction accuracy. In our last contribution, we analyze the effects of confirmation bias in semi-supervised learning when faced with datasets comprising challenging characteristics as they appear frequently in real-world data. In particular, we consider a high imbalance within and between classes as well as a high similarity between classes. We demonstrate the limitations of semi-supervised methods in overcoming confirmation bias when the data is randomly and passively labeled. By choosing better data samples through active learning, we discuss how confirmation bias can be mitigated, showcasing the potential of combining semi-supervised learning and active learning in the presence of common real-world data challenges.
Not available
Gilhuber, Sandra
2025
Englisch
Universitätsbibliothek der Ludwig-Maximilians-Universität München
Gilhuber, Sandra (2025): Advances in deep active learning and synergies with semi-supervision. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik
[thumbnail of Gilhuber_Sandra.pdf]
Vorschau
PDF
Gilhuber_Sandra.pdf

12MB

Abstract

Supervised deep learning models have successfully enabled the automation of processes and the discovery of valuable insights within large datasets, exceeding the capabilities of humans in analyzing and managing the constantly growing volumes of data. However, the effectiveness of these models largely depends on the availability of sufficient high-quality annotated training data. While collecting unlabeled data is often possible with comparably low effort, labeling it is laborious, time-consuming, and costly. In many domains, such as medical or industrial applications, providing accurate annotations requires specialized expertise, which is both scarce and expensive. It is, therefore, essential to reduce the necessity for manual labeling wherever possible. In this thesis, we address the challenge of insufficient and costly annotations. In particular, we contribute to the field of deep active learning. Unlike traditional approaches that passively rely on pre-labeled data, active learning employs an iterative process alternating between training and labeling. By utilizing the model to decide which instances are most useful for its learning process, the performance is enhanced with a smaller amount of labeled data. Semi-supervised learning is a related field dealing with limited labeled data, which aims to improve models by leveraging both labeled and unlabeled data. Our contributions include new methods and insights into active learning as well as its combination with semi-supervised learning to exploit the strength of both. Modern deep active learning strategies typically combine model uncertainty with sample diversity to avoid labeling data with redundant information. However, ensuring diversity by calculating distances in learned representations is computationally expensive, particularly for complex, high-dimensional neural networks. Our first contributions address this limitation. We propose using the prediction probabilities to simultaneously select diverse and uncertain instances, substantially accelerating query selection and returning a qualitative query set. Our method proves effective for both tabular and image classification, being superior to competitors in label and time efficiency. Our next contribution focuses on active learning for node classification. The edges in a graph provide valuable insights into both the importance of individual nodes and the overall graph structure. Hence, it is essential to consider them when actively selecting the most useful instances for labeling. In our work, we introduce a novel active learning method for node classification that leverages diffusion-based graph heuristics in multiple ways for graph learning as well as actively querying nodes for labeling. In contrast to existing methods, our approach demonstrates robust performance across diverse datasets and consistently surpasses random sampling. Moreover, due to pre-computations, it is faster than competitors. Finally, we turn our attention to the task of image classification with a particular focus on the combination of techniques from semi-supervised learning and active learning. Our first contribution in this domain proposes a novel active pseudo-labeling approach. We show that false pseudo-labels often occur during the initial iterations where label information is particularly sparse, resulting in long-term negative effects due to confirmation bias. To mitigate this, we propose a solution to refine the pseudo-labels produced by a model based on their consistency with predictions of a second model, considerably improving prediction accuracy. In our last contribution, we analyze the effects of confirmation bias in semi-supervised learning when faced with datasets comprising challenging characteristics as they appear frequently in real-world data. In particular, we consider a high imbalance within and between classes as well as a high similarity between classes. We demonstrate the limitations of semi-supervised methods in overcoming confirmation bias when the data is randomly and passively labeled. By choosing better data samples through active learning, we discuss how confirmation bias can be mitigated, showcasing the potential of combining semi-supervised learning and active learning in the presence of common real-world data challenges.