Computer Vision

On-the-fly unsupervised object discovery

We have been studying how an agent (e.g., a robot) can perceive and build knowledge about the surrounding environment progressively, and how an agent can improve the knowledge-building process while continually interacting with the environment, under minimal human supervision and limited computing resources. Our goal is to establish a principle for the interaction loop between data-driven perception and control/trajectory planning, in a way that the former provides provable guarantees to the latter, in terms of minimum acceptable performance. As in the human paradigm, agent trajectories yield sensor data for processing on-the-fly, while scene perception (e.g., visual object discovery and recognition) dictates the nature of consequent trajectories.

We aim at developing novel, computationally-efficient methods for progressive visual learning (knowledge-building mechanism), which discover spatio-temporally salient structures in streaming data, discover objects in a bottom-up fashion (from observable features, to candidate object parts, to objects), and dynamically build object classes on-the-fly. Such learning mechanisms should provide guarantees for successful discovery, in terms of mathematically provable parameters, which can be used for exploration of unknown environments, thus advancing autonomy, safety, and intelligence for a wide class of modern (control) systems.

Stream clustering with fading affect bias. We developed a fast and accurate center-based, single-pass clustering method, with main focus on improving the trade-off between accuracy and speed in computer vision problems, such as creating visual vocabularies on-the-fly. We use a stochastic Mean-shift procedure to seek the local density peaks within a single pass of the data. We incorporate a dynamic kernel generation along with a density test procedure that finds the most promising kernel initializations. In our algorithm, we use two data structures, namely a dictionary of permanent kernels, and a "short memory" that is used to determine emerging kernels to be maintained and outliers to be discarded.

ICPR2018 paper -- Code in Matlab implementation (zip)

Model-based image segmentation

CRF-driven deformable model. We developed a topology independent solution for segmenting objects with texture patterns of any scale, using an implicit deformable model driven by Conditional Random Fields (CRFs). Our model integrates region and edge information as image driven terms, whereas the probabilistic shape and internal (smoothness) terms use representations similar to the level-set based methods. The evolution of the model is solved as a MAP estimation problem, where the target conditional probability is decomposed into the internal term and the image-driven term. For the later, we use discriminative CRFs in two scales, pixel- and patch- based, to obtain smooth probability fields based on the corresponding image features.

Deformable Probability Maps. Going a step beyond coupling deformable models with classification, we developed the Deformable Probability Maps (DPMs) for object segmentation, which are solidly graphical learning models incorporating deformable model properties among the sites (cliques). The DPM configuration is described by probabilistic energy functionals, which incorporate shape and appearance, and determine 1D and 2D (boundary and surface) smoothness, image region features consistency, and topology with respect to the image salient edges. Similarly to deformable models, DPMs are dynamic, and their evolution is solved as a MAP inference problem.




Machine Vision-assisted In Situ Ichtyoplankton Imaging System


A collaboration with RSMAS, U of Miami: http://www.cs.iupui.edu/~gavriil/vital/MVISIIS/

R.K. Cowen's team at RSMAS, U of Miami, has designed and built a plankton imaging system (In Situ Ichthyoplankton Imaging System, ISIIS) capable of imaging large water volumes with the goal of quantifying even rare plankton in situ. ISIIS produces very high resolution imagery for extended periods of times necessitating automated data analysis and recognition.

Since we require the identification and quantification of a large number of organisms, we are developing a fully automated software for detection and recognition of organisms of interest using machine vision and learning tools. Our framework aims at (i) the detection of all organisms of interest automatically, directly from the raw data, while filtering out noise and out-of-focus instances, (ii) the extraction and modeling of the appearance of each segmented organism, and (iii) the fully automated recognition of all the detected organisms simultaneously, using appearance and topology information in a novel classification framework. What differentiates our work from existing systems is that we are aiming at recognizing simultaneously all existing org




Integration of active learning in a collaborative Conditional Random Field

We developed an active learning approach for visual multiple object class recognition, using a Conditional Random Field (CRF) formulation. We name our graphical model 'collaborative', because it infers class posteriors in instances of occlusion and missing information by assessing the joint appearance and geometric assortment of neighboring sites. The model can handle scenes containing multiple classes and multiple objects inherently while using the confidence of its predictions to enforce label uniformity in areas where evidence supports similarity. Our method uses classification uncertainty to dynamically select new training samples to retrain the discriminative classifiers used in the CRF. We demonstrated the performance of our approach using cluttered scenes containing multiple objects and multiple class instances.




Learning-based dynamic coupling of pose estimation (static) and tracking (temporal)

There are generally two major approaches in deformable and articulated object tracking: (i) continuous (or temporal) methods that use both temporal and static information from the input sequence, and (ii) discrete methods, which handle each frame separately, using only static information and some kind of prior knowledge.

Continuous trackers provide high accuracy and low complexity, exploiting the continuity constraints over time, but when they lose track, they usually cannot recover easily. On the other hand, discrete approaches do not suffer from error accumulation over time, giving independent solutions at each time instance, but their accuracy depends on the generality of the prior knowledge they utilize; also, when this prior knowledge is derived from databases, the computational time increases dramatically.

We developed a new framework for robust 3D tracking, to achieve high accuracy and robustness, combining the aforementioned advantages of the continuous and discrete approaches. Our approach consists of a data-driven dynamic coupling between a continuous tracker and a novel discrete shape estimation method. Our discrete tracker utilizes a database, which contains object shape sequences, instead of single shape samples, introducing a temporal continuity constraint. The two trackers work in parallel, giving solutions for each frame separately. While a tightly coupled system would require high computational complexity, our framework chooses instantly the best solution from the two trackers, based on an error. This is the actual 3D error, i.e., the difference between the expected 3D shape and the estimated one. When tracking objects with high degrees of freedom and abrupt motions, it is difficult to obtain such 3D information, since there is no ground-truth shape available. In our framework, we learn off-line the 3D shape error, based on the 2D appearance error, i.e., the difference between the tracked object's edges and the edges of the utilized model's projection on the image plane.




Dynamically adaptive tracking of gestures and facial expressions

Behavioral indicators of deception and behavioral states are extremely difficult for humans to analyze. Our framework aims at analyzing nonverbal behavior on video, by tracking the gestures and facial expressions of an individual that is being interviewed.

The system uses two cameras (one for the face and one for the whole body view), for analysis in two different scales, and consists of the following modules: (a) head and hands tracking, using Kalman filtering and a data-driven adaptive (to each specific individual) skin regions detection method, (b) shoulders tracking, based on a novel texture-based edge localization method, (c) 2D facial features tracking, using a fusion between the KLT tracker and different Active Shape Models, and (d) 3D face and facial features tracking, using the 2D tracking results and a model-based 3D face tracker. The main advantages of our framework is that we can track both gestures and facial expressions with great accuracy and robustness, in rates higher than 20 fps.




The infinite Hidden Markov Random Field model

Hidden Markov random field (HMRF) models are parametric statistical models widely used for image segmentation, as they appear naturally in problems where a spatially-constrained clustering scheme is asked for. A major limitation of HMRF models concerns the automatic selection of the proper number of their states, i.e. the number of segments derived by the image segmentation procedure. Typically, for this purpose, various likelihood based criteria are employed. Nevertheless, such methods often fail to yield satisfactory results, while their use entails a significant computational burden. Recently, Dirichlet process mixture (DPM) models have emerged in the cornerstone of nonparametric Bayesian statistics as promising candidates for clustering applications where the number of clusters is unknown a priori.

Inspired by these advances, to resolve the aforementioned issues of HMRF models, we introduced a novel, nonparametric Bayesian formulation for the HMRF model, the infinite HMRF (iHMRF) model, formulated on the basis of a joint DPM and HMRF construction. We derived an efficient variational Bayesian inference algorithm for the proposed model, and we applied it to a series of image segmentation problems demonstrating its advantages over existing learning-based methodologies.