Logo Logo
Help
Contact
Switch language to German
On the importance of symbol grounding and top-down processes in computer vision
On the importance of symbol grounding and top-down processes in computer vision
In the past decade, feedforward artificial neural networks have stormed the field of artificial intelligence and shown impressive results in many domains. Nevertheless, one of the challenges in artificial intelligence is connecting the differentiable feature space in deep learning to the rich world of object-based, symbolic knowledge. For example, in computer vision, images consist of different features, such as edges and curves at a lower level, while at a higher level, they include objects and relations. Even though it is not feasible to describe the low-level features using the natural language, the attributes and relations between objects can be represented by symbols and are well-documented throughout human literature. Therefore, developing novel and effective architectures that can learn and utilize symbolic knowledge within the differentiable deep learning framework is essential. To this end, in this dissertation we argue for methods that map symbols to image-grounded representations such that they share the same representation space as images. Furthermore, we discuss the key role of top-down processes in utilizing object-level knowledge; top-down signals have been shown to play a significant role in the human brain for overcoming challenges such as occlusion. For example, even though there might not be enough pixels from a truck's wheel in an image, after detecting the truck itself within the top layers of a neural network, we can use the higher-level knowledge to recognize a small area in a corner that corresponds to the wheel. Nevertheless, current feedforward neural networks lack effective inductive biases for top-down processing. We show that grounding symbols in images and employing top-down mechanisms not only improves the scene understanding but also allows us to benefit from the massive pool of human-written symbolic knowledge in addition to image annotations. In summary, this dissertation introduces significant advances in the artificial intelligence domain, particularly computer vision and modeling commonsense. We propose models that utilize (1) structured knowledge, (2) unstructured text, and (3) 3d information to improve scene understanding, and through large-scale experiments, we show that our models significantly improve state-of-the-art results.
Visual Language Models, Scene Graph Classification, Symbol Grounding, Schemata, Piaget, Computer Vision
Sharifzadehgolpayegani, Sahand
2023
English
Universitätsbibliothek der Ludwig-Maximilians-Universität München
Sharifzadehgolpayegani, Sahand (2023): On the importance of symbol grounding and top-down processes in computer vision. Dissertation, LMU München: Faculty of Mathematics, Computer Science and Statistics
[thumbnail of Sharifzadehgolpayegani_Sahand.pdf]
Preview
PDF
Sharifzadehgolpayegani_Sahand.pdf

9MB

Abstract

In the past decade, feedforward artificial neural networks have stormed the field of artificial intelligence and shown impressive results in many domains. Nevertheless, one of the challenges in artificial intelligence is connecting the differentiable feature space in deep learning to the rich world of object-based, symbolic knowledge. For example, in computer vision, images consist of different features, such as edges and curves at a lower level, while at a higher level, they include objects and relations. Even though it is not feasible to describe the low-level features using the natural language, the attributes and relations between objects can be represented by symbols and are well-documented throughout human literature. Therefore, developing novel and effective architectures that can learn and utilize symbolic knowledge within the differentiable deep learning framework is essential. To this end, in this dissertation we argue for methods that map symbols to image-grounded representations such that they share the same representation space as images. Furthermore, we discuss the key role of top-down processes in utilizing object-level knowledge; top-down signals have been shown to play a significant role in the human brain for overcoming challenges such as occlusion. For example, even though there might not be enough pixels from a truck's wheel in an image, after detecting the truck itself within the top layers of a neural network, we can use the higher-level knowledge to recognize a small area in a corner that corresponds to the wheel. Nevertheless, current feedforward neural networks lack effective inductive biases for top-down processing. We show that grounding symbols in images and employing top-down mechanisms not only improves the scene understanding but also allows us to benefit from the massive pool of human-written symbolic knowledge in addition to image annotations. In summary, this dissertation introduces significant advances in the artificial intelligence domain, particularly computer vision and modeling commonsense. We propose models that utilize (1) structured knowledge, (2) unstructured text, and (3) 3d information to improve scene understanding, and through large-scale experiments, we show that our models significantly improve state-of-the-art results.