Dr. Murat Dundar
Assistant Professor
Computer & Information Science Dept.

 

PROJECTS:

CAREER: Self-adjusting Models as a New Direction in Machine Learning

Project period: 3/1/2013 - 2/28/2018

Abstract:

Traditional supervised learning algorithms assume that the list of classes defined by a training data set is exhaustive and that new data samples originate from one of the existing classes represented in the training data set. This assumption is not very realistic in many real-world domains as the data-generating mechanisms constantly evolve and new classes of interest emerge on a continual basis.  Under such circumstances it is impractical if not impossible to define a training data set with a complete set of classes. When the training data set is not exhaustively defined, a future sample of a class not represented in the training data set will be misclassified with certainty, leading to an ill-defined classification problem.

This study offers a new direction for supervised learning that relaxes the fixed-model assumption defined by the existing data in order to have a self-adjusting model that can evolve by dynamically adding new classes to better accommodate prospective data in offline as well as online settings. Specifically, the aims of the project include (1) studying non-parametric prior models to dynamically model the number of classes (2) developing new online and offline inference techniques in partially-observed settings (3) modeling the rapidly accumulating nature of samples evident with emerging classes (4) automatically associating a newly discovered class with higher-level groups of classes in an attempt to identify potentially interesting class formations, and (5) developing partially-observed tree models containing observed and unobserved nodes, where observed nodes represent existing classes and unobserved nodes are introduced online to fill the gaps in the existing data hierarchy that become evident only with the arrival of new data.

The broader consequences of this work will extend to following areas: 1. Bio-security and bio-surveillance: The developed algorithms could become highly instrumental in implementing a real-time intelligent bio-warning platform for identifying national outbreaks as early as possible. 2. Information retrieval: This study introduces a whole new approach to indexing documents, by an evolving vocabulary that may lead to more efficient indexing with significantly improved relevancy, consistency, and timeliness. 3. Remote sensing: This new framework may become essential for fully exploiting the wealth of spectral information available in hyper-spectral images, allowing for in-depth and high-level image analysis of scenes with dynamic and distinct characteristics. The project also opens exciting possibilities to enhance the research and education environment for K-12, undergraduate, and graduate students, giving them an unparalleled opportunity to work on stimulating cross-disciplinary applications that combine computational and life sciences. A workshop and a scientific competition will be organized to raise the awareness of scientific communities concerning the studied problems.

Additional information about the project can be accessed through the project website at http://www.cs.iupui.edu/~dundar/career.html


Automated Spectral Data Transformations and Analysis Pipeline for High Throughput Flow Cytometry

Grant number: 5R21EB015707-02
Project period: 7/1/2012 - 6/30/2014
Investigators: Bartek Rajwa (PI), Murat Dundar (co-I), Alex Pothen (co-I)

Abstract:

High-throughput flow cytometry is an emerging cell-analysis and screening technique employed in various fields of life-sciences, including drug discovery and clinical research. One of the major limitations of HT-FC is the lack of robust, rapid, and reproducible tools for data analysis and data mining. The current paradigm of FC analysis does not fit suit the HT format well. Traditionally, FC data are analyzed employing interactive exploratory visualization, which requires preparing a number of 2-D scatter plots that are used by an FC operator or researcher for visual evaluation of sample characteristics. Although the recent interest of computer science and bioinformatics communities in FC has spurred development of automated compensation and gating techniques, the proposed algorithms still follow the traditional analysis pathway (compensation plus gating), and typically attempt to mimic trained human operators in delineating various cell populations defined by the presence of fluorescent markers of varying intensities. Unfortunately, this model is not sustainable when hundreds or thousands of data sets must be processed in real time. This proposed research attempts to radically re-invent the FC data analysis pipeline for high-throughput FC by employing spectral classification approaches to FC data. In the proposed framework the FC data will be modeled as a mixture of signals that can be quantitatively recovered if certain physical and biological constraints describing the experimental system are rigorously followed. We propose a set of algorithms that will allow us first to define and encode the domain knowledge describing the analyzed specimens, subsequently to approximate the concentrations of labels, and from there recover information about the presence or absence of specific phenotypes of interest. The techniques employed will functionally replace two steps in FC data analysis that have traditionally been viewed as separate: compensation and gating. Instead, a new iterative spectral classification process will recover the quantitative characteristc of samples. This will allow for fast and automated extraction of sample features, as well as for mining the collected specimens for similar datasets. The proposed algorithm will be prototyped using R language for statistical computing, and relevant procedures will be made available to other researchers in the field of FC via the Bioconductor project. Upon successful testing and validation using various datasets contributed by collaborators, the classification algorithms will be implemented in PlateAnalyzer, an HT-FC data analysis package developed at Purdue University. PUBLIC HEALTH RELEVANCE: Flow cytometry (FC) is an important single-cell analysis tool employed in various clinical and research applications. The currently used FC data-analysis paradigm utilizes an exploratory, interactive model requiring operators to evaluate samples manually using expertise and experience. This project attempts to build an automated, robust, reproducible, and operator- independent data-analysis system that can be employed for FC data processing and data mining, limiting subjectivity and enhancing the value of FC techniques.


Machine-learning Approach to Label-free Detection of New Bacterial Pathogens

Grant number: 5R21AI085531-02
Project period: 5/1/2010 - 4/30/2012 (one year no-cost extension granted through 4/30/2013)
Investigators: Murat Dundar (PI) and Bartek Rajwa (PI)

Abstract:

Technologies for rapid detection and classification of bacterial pathogens are crucial for securing the food supply. A light-scattering sensor recently developed for real-time detection and identification of colonies of multiple pathogens has shown great promise for distinguishing bacteria cultures at the genus and species level for Listeria, Staphylococcus, Salmonella, Vibrio, and Escherichia Coli. Unlike traditional testing methods, this new technology does not require a labeling reagent or biochemical processing. The classification approach currently used with this technology relies on supervised learning. For an accurate detection and classification of bacterial pathogens, the training library used to train the classifier should consist of samples of all possible forms of the pathogens. Construction of such a training library is impractical if not impossible due to the high mutation rate that characterizes some of the infectious agents. In this project we propose to advance this sensor technology to allow for the detection of new classes/subclasses of bacteria, which do not exist in the training library. Learning with a non-exhaustive training library is an ill-defined problem. We design a two stage classification scheme to alleviate this problem. The first stage, i.e. detection, identifies whether the bacteria sample belongs to one of the subclasses in the training library or a yet unseen and thus unrepresented subclass. If the former is true, the sample is fed to the second stage, i.e. classification, where it is classified to one of the existing subclasses. If the latter is true, an alert is raised and the sample is saved for follow-up analysis. Benefit for Public Health: Successful implementation of this project will allow for a label-free detection and identification of food pathogens and their mutated subclasses not yet seen earlier. This will reduce the number of food related outbreaks and will help secure our food supply.

Publications:

Ferit Akova, Yuan Qi, Bartek Rajwa, Murat Dundar, “Self-adjusting Models for Semi-supervised Learning in Partially-observed Settings,” In Proceedings of the IEEE International Conference on Data Mining (ICDM’12), Brussels, Belgium, December 10-13, 2012. (To appear as a full paper, acceptance rate: 11%) PDF

Murat Dundar, Ferit Akova, Yuan Qi, Bartek Rajwa, “Bayesian Nonexhaustive Learning for Online Discovery and Modeling of Emerging Classes,” In John Langford and Joelle Pineau (Eds.), Proceedings of the 29th International Conference on Machine Learning (ICML'12), Edinburgh, Scotland, June 26-July 1, 2012 (pp. 113-120). Omnipress, 2012. PDF

Bartek Rajwa, Murat Dundar, Ferit Akova, Valery Patsekin, Euiwon Bae, Yanjie Tang, J. Eric Dietz, E. Daniel Hirleman, J. Paul Robinson, Arun K. Bhunia, "Digital microbiology: detection and classification of unknown bacterial pathogens using a label-free laser light scatter-sensing system", Proceedings of SPIE, 8029, May 2011.

Bartek Rajwa, Murat Dundar, Ferit Akova, Amanda Betasso, Valery Patsekin, E. Dan Hirleman, Arun K. Bhunia, J. Paul Robinson, “Discovering unknown: detection of emerging pathogens using label-free light scattering system,” Cytometry Part A, 77A(12):1103–1112, 2010 (PMCID: PMC3224816). PDF

Ferit Akova, Murat Dundar, V. Jo Davisson, E. Daniel Hirleman, Arun K. Bhunia, J. Paul Robinson, Bartek Rajwa, “A Machine-learning Approach for Label-free Detection of Unmatched Bacterial Serovars”, Statistical Analysis and Data Mining Journal, 3(5):289-301, 2010 (PMCID: PMC3230886). PDF