Embedding High-Level Information into Low Level Vision: Efficient Object Search in Clutter

Ching L. Teo, Austin Myers, Cornelia Fermüller, Yiannis Aloimonos

We present a novel visual search method that exploits the mid-level grouping capabilities of the image-torque [1] operator for the purposes of recognizing objects under clutter. The input is a RGB-Depth image and a set of target shape models that represents the object. We then tune the image-torque operator via a shape conforming distance metric so that the target object exhibits the largest torque value. The maxima (and minima) of this "object-tuned" modulated torque map can then be used as potential centroid locations of the target objects in the scene. Due to the mid-level nature of the operator, it is inherently robust to clutter. We demonstrate experimentally that the proposed method is able to detect different objects under increasing clutter on a table compared [1] in addition to other state of the art bottom-up saliency detectors: Itti et al. [2], Harel et al. (GBVS) [3] and Kernel-based object detectors of Bo et al. [4].

Example results

Some example detection results comparing original torque (middle) and modulated torque (right) for three objects: flashlight, cap and tissue-box. Notice that there are less false maxima/minima between objects in the modulated torque and the top detections are found in the correct target.


Object retrieval accuracy measured in terms of Cumulative Matching Curves (CMC). A method is better if it has a higher hit rate with less fixations. (a) and (b): CMC scores for different objects of the proposed "Top-Down" approach applied over the UMD clutter dataset and RGB-D scenes dataset of [5] respectively. (c) and (d): mean CMC scores comparing different approaches.

Detailed results for each object category for different approaches are available here.



  • [1] M. Nishigaki, C. Fermüller, and D. Dementhon. "The Image Torque Operator: A New Tool for Mid-level Vision". Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , pp. 502–509, 2012
  • [2] L. Itti, C. Koch, and E. Niebur. "A model of saliency-based visual attention for rapid scene analysis". IEEE Trans. Pattern Analysis and Machine Intelligence, 20(11):1254–1259, Nov. 1998
  • [3] J. Harel, C. Koch, and P. Perona. "Graph-based visual saliency", In Advances in neural information processing systems (NIPS), pages 545–552, 2006.
  • [4] L. Bo, X. Ren, and D. Fox. "Kernel descriptors for visual recognition", In Advances in Neural Information Processing Systems (NIPS), pp. 244-252. 2010.
  • [5] University of Washington RGB-D Object dataset.
  • Acknowledgements

    The support of the European Union under the Cognitive Systems program (project POETICON++), the National Science Foundation under the Cyberphysical Systems Program and the Qualcomm Innovation Fellowship (Ching L. Teo) is gratefully acknowledged.

    Questions? Please contact cteo "at" umd dot edu