Integrating Vision and Language Beyond the Semantic Level


How do humans experience the visual world in all its richness, diversity and complexity? And is it even possible to recreate such a perceptual experience on a computer? This has remained one of the most important questions in Computer Vision. My current dissertation work in integrating language with vision is an initial attempt to understand visual perception and the goal of this brief is to summarize my research thus far towards this goal.

The key insight is that the visual system never works alone -- we integrate our experiences, our biases, our memories into all that we see. For example, if you were told to look for a knife, visual representations of the knife -- with its representative parts and shape would guide your search. Contextual knowledge comes into play here -- knowing that knives often occurs in the kitchen would have stopped you from a futile search in another room. Such knowledge is the result of years of learning that we have acquired from our experiences with the World. For a computer to even have a chance of even reaching the visual performance of humans, such experiences must somehow be embedded. The "Cognitive Dialog" as discussed in this PerMIS paper shows that this is indeed a plausible model for integrating textual knowledge into an active vision system.

Integrating Language and Vision

My main thesis is that since humans record a large amount of how the world functions in words and text, it is possible to use techniques from Computational Linguistics to learn associations and derive contextual relationships -- such as knowing that knives occur frequently in kitchens. The initial work done in our EMNLP paper have shown that with enough textual data, it is possible to learn a reasonable statistical model that relates visual scene interpretation to language. This relationship, however is only learned at the semantic or label level, that is we are associating the output of an independent object detection algorithm and comparing it with a textual database. My most current work is to develop techniques that will make the integration of language on visual processes even deeper, by influencing directly the visual processes before it actually detects the objects as shown in the ICRA paper. In other words, can we change our visual algorithms "on the fly" using high-level knowledge from language directly?

Current work: Contour-based Object Recognition

FIG 1. Guiding object recognition with textual knowledge. (I) Textual attributes of objects. (II) Learning relationships to extracted model contours. (III) Object recognition from supporting contours: (a) Contours corresponding to target model -- red means more similar. (b) "Heat map" from weighted contours of object's location. (c) Detected object in blue box.

My current research is now focused on exploring how this can be done for the purposes of object detection and recognition. As shown in FIG.1(I) above, an object is described using certain adjectives such as "long", "round", "flat", etc. Such knowledge can be obtained by asking people to speak out descriptive words for the objects -- resulting in a corpus of textual attributes that we can use for learning associations. Next, we need to learn the relationship between such attributes to a simple representation of the object that we want to recognize -- for example an outline as shown in FIG.1(II). Once this relationship is learned, the visual processes can then be tuned appropriately (by adjusting its parameters) to search specifically for a set of similar outlines as the target object (FIG.1(III)). The key contribution of this work is to bring object recognition to a new level: where the system can generate new object representations based simply from their textual description and then search for similar looking objects in new environments. Below is a video of the output of an prototype of the system in action where it was asked to look for "mug-like" objects which are marked by a cross, and the image on the left shows a "heat map" of where the target could be from the system:

MEDIA 1. Guiding object recognitions towards "mugs".

Last Updated: Apr 03, 2013