Sharifzadehgolpayegani, Sahand (2023): On the importance of symbol grounding and top-down processes in computer vision. Dissertation, LMU München: Faculty of Mathematics, Computer Science and Statistics |
Preview |
PDF
Sharifzadehgolpayegani_Sahand.pdf 9MB |
Abstract
In the past decade, feedforward artificial neural networks have stormed the field of artificial intelligence and shown impressive results in many domains. Nevertheless, one of the challenges in artificial intelligence is connecting the differentiable feature space in deep learning to the rich world of object-based, symbolic knowledge. For example, in computer vision, images consist of different features, such as edges and curves at a lower level, while at a higher level, they include objects and relations. Even though it is not feasible to describe the low-level features using the natural language, the attributes and relations between objects can be represented by symbols and are well-documented throughout human literature. Therefore, developing novel and effective architectures that can learn and utilize symbolic knowledge within the differentiable deep learning framework is essential. To this end, in this dissertation we argue for methods that map symbols to image-grounded representations such that they share the same representation space as images. Furthermore, we discuss the key role of top-down processes in utilizing object-level knowledge; top-down signals have been shown to play a significant role in the human brain for overcoming challenges such as occlusion. For example, even though there might not be enough pixels from a truck's wheel in an image, after detecting the truck itself within the top layers of a neural network, we can use the higher-level knowledge to recognize a small area in a corner that corresponds to the wheel. Nevertheless, current feedforward neural networks lack effective inductive biases for top-down processing. We show that grounding symbols in images and employing top-down mechanisms not only improves the scene understanding but also allows us to benefit from the massive pool of human-written symbolic knowledge in addition to image annotations. In summary, this dissertation introduces significant advances in the artificial intelligence domain, particularly computer vision and modeling commonsense. We propose models that utilize (1) structured knowledge, (2) unstructured text, and (3) 3d information to improve scene understanding, and through large-scale experiments, we show that our models significantly improve state-of-the-art results.
Item Type: | Theses (Dissertation, LMU Munich) |
---|---|
Keywords: | Visual Language Models, Scene Graph Classification, Symbol Grounding, Schemata, Piaget, Computer Vision |
Subjects: | 000 Computers, Information and General Reference 000 Computers, Information and General Reference > 004 Data processing computer science |
Faculties: | Faculty of Mathematics, Computer Science and Statistics |
Language: | English |
Date of oral examination: | 7. February 2023 |
1. Referee: | Tresp, Volker |
MD5 Checksum of the PDF-file: | 48927f5eccb07b703671dfde6432d162 |
Signature of the printed copy: | 0001/UMC 30217 |
ID Code: | 33176 |
Deposited On: | 27. Feb 2024 15:05 |
Last Modified: | 28. Feb 2024 14:41 |