Dr. Murat Dundar
Associate Professor
Computer & Information Science Dept.

 

CAREER: Self-adjusting Models as a New Direction in Machine Learning

Award number: 1252648
Project period: 3/1/2013 - 2/28/2018
 


Abstract:

Machine learning algorithms are now routinely used to build predictive models from data in wide range of applications. However, current approaches to machine learning have an important limitation: They assume that the set of classes observed in a training data set is exhaustive and that new data samples originate from one of the existing classes represented in the training data set. This assumption is unrealistic in many real-world applications in which previously unobserved classes of interest emerge.

This study explores a new class of machine learning algorithms that produce self-adjusting models that can accommodate new classes observed in data in offline as well as online learning scenarios. The project aims to (i) use non-parametric models to dynamically incorporate the changing number of classes; (ii) develop new online and offline inference techniques to accommodate new classes as they emerge (iii) automatically associate newly discovered classes with higher-level groups of classes in an attempt to identify potentially interesting class formations, and (iv) develop partially-observed tree models containing observed and unobserved nodes, where observed nodes represent existing classes and unobserved nodes are introduced online to fill the gaps in the existing data hierarchy that become evident only with the arrival of new data.

The broader impacts of this work could extend several real world applications: Bio-security and bio-surveillance, information retrieval, and remote sensing among others in settings where all of the classes are not known a priori. The educational plan includes outreach to K-12 students and enhanced research opportunities for undergraduate and graduate students in computer science as well as at the intersection of computational and life sciences. All the software, publications, and data sets resulting from the project will be freely disseminated to the larger research and educational community. 

Publications:

  • Sarkhan Badirli, Zeynep Akata, George Mohler, Christine Picard, Murat Dundar, "Fine-Grained Zero-Shot Learning with DNA as Side Information," Conference on Neural Information Processing Systems (NeurIPS'21), 2021. Online

  • Sarkhan Badirli, Zeynep Akata, Murat Dundar, “Bayesian Zero-shot Learning” ECCV Workshops 2020.

  • Baichuan Zhang, Murat Dundar, Vachik Dave, Muhammad Al Hasan, “Dirichlet Process Gaussian Mixture for Active Online Name Disambiguation by Particle Filter,” in Proceedings of Joint Conference on Digital Library, 2019. PDF

  • Ellen Leask, Bethany Ehlmann, Murat Dundar, “Investigating Hydrated Mineral Deposits in Tera Sirenum Mars,” Lunar and Planetary Science Conference 50, 2019.

  • Murat Dundar, Bethany Ehlmann, Ellen Leask, “Rare Phase Detections in CRISM Data at Pixel-scale by Machine Learning Generate New Discoveries about Geology at Mars Rover Landing Areas: Jezero and NE Syrtis,” Lunar and Planetary Science Conference 50, 2019.

  • Yicheng Cheng, Bartek Rajwa, Murat Dundar, "Bayesian Nonparametrics for Non-exhaustive Learning," Advances in Neural Information Processing Systems (NIPS), Bayesian Nonparametrics Workshop, 2018. PDF

  • Ellen Leask, Bethany Ehlmann, Murat Dundar, Scott Murchie, Frank Seelos, “Challenges in the Search for Perchlorate and Other Hydrated Minerals with 2.1μm Absorptions on Mars,” Geophysical Research Letters, 45(22), 2018. PDF

  • Ellen Leask, Bethany Ehlmann, Murat Dundar, Scott Murchie, Frank Seelos, “New Possible CRISM Artifact at 2.1 Micrometers and Implications for Orbital Mineral Detections,” Lunar and Planetary Science Conference 49, 2018.

  • Yicheng Cheng*, Murat Dundar, George Mohler, “A Coupled ETAS-I2GMM Point Process with Applications to Seismic Fault Detection,” Annals of Applied Statistics,12(3), pp. 1853-1870, 2018.

  • Halid Z. Yerebakan* and Murat Dundar, “Partially Collapsed Parallel Gibbs Sampler for Dirichlet Process Mixture Models,” Pattern Recognition Letters, 90, pp.22-27, 2017.

  • Bethany Ehlmann and Murat Dundar, "Acidic Conditions During Open System Weathering on Late Noachian/Early Hesperian Mars? Newly Identified Outcrops of Alunite and Jarosite from Orbital CRISM Data,", AAS/Division for Planetary Sciences Meeting Abstracts, 2016.

  • Baichuan Zhang, Murat Dundar, Muhammed Hasan, "Bayesian Non-Exhaustive Classification A Case Study: Online Name Disambiguation using Temporal Record Streams," in Proceedings of ACM CIKM, Indianapolis, US, Oct 2016. PDF

  • Murat Dundar and Bethany Ehlmann, "Rare Jarosite Detection in CRISM Imagery by Non-Parametric Bayesian Clustering," in Proceedings of IEEE WHISPERS'16, Los Angeles, US, Aug 2016. PDF

  • Bartek Rajwa, Paul Wallace, Elizabeth Griffiths, Murat Dundar, "Automated Assessment of Disease Progression in Acute Myeloid Leukemia by Probabilistic Analysis of Flow Cytometry Data," IEEE Transactions on Biomedical Engineering, 64(5), 2017. Online

  • Murat Dundar, Qiang Kou, Baichuan Zhang, Yicheng He, and Bartek Rajwa, “Simplicity of Kmeans versus Deepness of Deep Learning: A Case of Unsupervised Feature Learning with Limited Data,” In Proceedings of IEEE International Conference on Machine Learning Applications, Miami, FL, USA, December 11-13, 2015. PDF

  • Bethany Ehlmann and Murat Dundar, "Are Noachian/Hesperian Acidic Waters Key to Generating Mars' Regional Scale Aluminum Phyllosilicates? The Importance of Jarosite Co-occurrences with Al-Phyllosilicate Units," 46th Lunar and Planetary Science Conference, The Woodlands, TX, March 16-20, 2015 (oral presentation). PDF

  • Halid Z. Yerebakan, Bartek Rajwa, Murat Dundar, "The Infinite Mixture of Infinite Gaussian Mixtures," Advances in Neural Information Processing Systems (NIPS'14), Montreal, Canada, December 8-13, 2014. (acceptance rate: 24.6%) PDF

  • Murat Dundar, Ferit Akova, Halid Z. Yerebakan, Bartek Rajwa, "A Non-parametric Bayesian Model for Joint Cell Clustering and Cluster Matching: Identification of Anomalous Sample Phenotypes with Random Effects," BMC Bioinformatics 15 (1), 314, 2014. Online

  • Murat Dundar, Halid Z. Yerebakan, Bartek Rajwa, "Batch Discovery of Recurring Rare Classes toward Identifying Anomalous Samples," In Proceedings of the 20th Annual SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD'14), New York, USA, Aug 24-27 2014. (acceptance rate: 15%) PDF Video Lecture

  • Murat Dundar, Bartek Rajwa, Lin Li, “Partially-observed Models for Classifying Minerals on Mars,” In Proceedings of WHISPERS'13, Gainesville, FL, June 25-28, 2013. PDF

Presentations:

  • Title: Bayesian Nonparametrics for Non-exhaustive Learning
    Presenter: Murat Dundar
    Venue: Lunar and Planetary Science Conference 50, The Woodlands, TX
    Presentation Type: Oral

  • Title: Rare Phase Detections in CRISM Data at Pixel-scale by Machine Learning Generate New Discoveries about Geology at Mars Rover Landing Areas: Jezero and NE Syrtis
    Presenter: Murat Dundar
    Venue: NIPS' 18 Bayesian Nonparametrics Workshop, Montreal, CA
    Presentation Type: Poster

  • Title: Bayesian Non-Exhaustive Classification A Case Study: Online Name Disambiguation using Temporal Record Streams
    Presenter: Baichuan Zhang
    Venue: ACM CIKM' 16, Indianapolis, IN
    Presentation Type: Oral

  • Title: Rare Jarosite Detection in CRISM Imagery by Non-Parametric Bayesian Clustering
    Presenter: Murat Dundar
    Venue: WHISPERS' 16, Los Angeles, CA
    Presentation Type: Oral

  • Title: Simplicity of Kmeans versus Deepness of Deep Learning: A Case of Unsupervised Feature Learning with Limited Data
    Presenter: Murat Dundar
    Venue: ICMLA' 15, Miami, FL
    Presentation Type: Oral

  • Title: The Infinite Mixture of Infinite Gaussian Mixtures
    Presenter: Halid Yerebakan
    Venue: NIPS' 14, Montreal, CA
    Presentation Type: Poster

  • Title: A non-parametric Bayesian model for joint cell clustering and cluster matching under random effects
    Presenter: Murat Dundar
    Venue: GLIIFCA'14, Oconomowoc, WI
    Presentation Type: Invited

  • Title: Batch Discovery of Recurring Rare Classes toward Identifying Anomalous Samples
    Presenter: Murat Dundar
    Venue: KDD' 14, New York, NY
    Presentation Type: Oral (Video)

Other Products:

News Releases:

Software:

  • ASPIRE: This is a software implemented in C++ for identifying recurring classes (both normal and rare) across a batch of samples that are significantly perturbed by random effects in a completely unsupervised way.

  • I2GMM: This is a software implemented in C++ for clustering data sets with well-defined albeit skewed/multi-mode clusters. It uses a two-level non-parametric Bayesian hierarchy of Gaussian mixture models.

Invention Disclosures:

  • Methods for Discovering Rare Cell Populations and Anomalous Samples in Flow Cytometry.
    Inventors: Murat Dundar and Bartek Rajwa

In-class Kaggle Contests:

  • CSCI 590 Machine Learning 2014. Automated Indexing of PubMed Abstracts using MeSH Terms

  • CSCI 590 Machine Learning 2015. Bacteria Classification at the Genus Level

  • CSCI 590 Machine Learning 2016. MARS Mineral Discovery Challenge

  • CSCI 590 Machine Learning 2017. Authorship Attribution Challenge

  • CSCI 590 Machine Learning 2018. MARS Mineral Discovery Challenge 2

  • CSCI 573 Statistical Machine Learning 2019. Automated Indexing of PubMed Abstracts using MeSH Terms 2

Graduate Students:

Undergraduate Students:

Spring 2014:

  • Jordyn Kramer
    Preparing wrapper files for running ASPIRE software in R

  • Brandon Upp
    Implementing an XML parser for parsing clinical records collected as part of the The Cancer Genome Atlas (TCGA) initiative

  • Nhan Do
    Modeling stripe noise in hyperspectral images acquired by CRISM (Compact Reconnaissance Imaging Spectrometers for Mars)

Fall 2014:

  • Nhan Do
    Preprocessing PubMed Abstracts for Automated MeSH Indexing

Spring 2015:

  • Kelly De Waal
    Literature Review on Large Scale Medical Informatics

  • Yicheng He
    Unsupervised Feature Learning using Optical Scatters of Bacterial Cultures

  • Nhan Do
    Automated MeSH Indexing for PubMed Abstracts
    IUPUI Center for Research and Learning RISE award has chosen Nhan Do to receive a $1,500 grant to help him complete this research project with Dr. Dundar.

Fall 2015:

  • Yicheng He
    Testing Theano Deep Learning Algorithms on the Diabetic Retinopathy Data set

Fall 2016:

  • Andrew Swineheart
    Identifying Interest Groups using Movie Ratings

Spring 2017:

  • Uladzimir Kasacheuski

  • Blake Conrad
    Clustering of US Cities based on crime patterns

Fall 2017:

  • James Beem
    Java GUI for hyperspectral imaging

Spring 2018:

  • Philip Varble and Robert McKay
    Java GUI for hyperspectral imaging

Spring 2019:

  • Bryce Poepplel, Travis Renfroe, Zachary Wang
    Bayesian Python Toolkit for CRISM Image Ratio'ing and Classification

Project Outcomes Report:

Pattern classes that originate in natural settings exhibit power-law property; hence, in a randomly sampled training set, no training examples are expected to be available for most of the pattern classes. As a result, contrary to common assumption, in modern machine learning problems data distributions are often not invariant  between training and test phases violating the main premise traditional machine learning operate on. The model is trained with a limited number of classes defined under ideal settings yet classification is often rendered in an uncontrolled environment where new classes of patterns and anomalies emerge on a continuous basis. This ill-defined setting creates a two-sided problem. First, the unknown class, which could potentially represent a significant abnormality such as a residual population of cancer cells in bone marrow or an emerging pathogenic bacteria strain present in food products, cannot be appropriately detected leading to potentially catastrophic consequences. Second, even if the unknown classes do not have any significance, misclassifying irrelevant samples into classes of practical importance raises serious doubts about the overall stability of the machine-learning systems and expose their vulnerability to adversarial attacks, as has recently been the case with some well-established deep learning models.

During the course of this project we studied this problem from a non-parametric Bayesian perspective by developing flexible models that can update on the fly to better accommodate non-stationary nature of data distributions. More specifically, we have developed the infinite mixture of infinite Gaussian mixtures as a doubly non-parametric model to deal with multi-mode and skewed class distributions and extended this model to classification problems with partially observed set of classes. We have implemented scalable MCMC inference algorithms that eliminate point dependency to allow for large-scale parallelization. We have proposed a new Bayesian approach to zero-shot learning that achieves significant improvements on both observed and unobserved class accuracies to achieve the best results on a variety of benchmark datasets among the currently published state of the art methods.

We tested our algorithms on significant machine learning problems that arise in a wide range of disciplines. Using our algorithms we have detected several rare mineral phases on Mars even in terrains with extensive traditional spectral analysis. These small detections not only offer new cues toward a more accurate and complete geologic mapping of Mars paving the way for future discoveries but are also crucial for guiding the Mars rovers and for contextualizing its discoveries. We have reported results showing that our algorithms can diagnose blood cell population abnormalities with perfect accuracy and can detect change in disease progression in blood cancer with near perfect accuracy. We have proposed a robust solution for online name disambiguation problem and demonstrated that this framework can not only identify articles written by known authors but can also identify articles written by unknown authors sharing the same name. We also developed a new point process model for space-time-magnitude modeling of earthquake events.