CAREER: Self-adjusting Models as a New Direction in Machine Learning
Award number: 1252648
Project
period: 3/1/2013 - 2/28/2018
Abstract:
Machine learning algorithms are now
routinely used to build predictive models from data in
wide range of applications. However, current approaches
to machine learning have an important limitation: They
assume that the set of classes observed in a training
data set is exhaustive and that new data samples
originate from one of the existing classes represented
in the training data set. This assumption is unrealistic
in many real-world applications in which previously
unobserved classes of interest emerge.
This study explores a new class of machine learning
algorithms that produce self-adjusting models that can
accommodate new classes observed in data in offline as
well as online learning scenarios. The project aims to (i)
use non-parametric models to dynamically incorporate the
changing number of classes; (ii) develop new online and
offline inference techniques to accommodate new classes
as they emerge (iii) automatically associate newly
discovered classes with higher-level groups of classes
in an attempt to identify potentially interesting class
formations, and (iv) develop partially-observed tree
models containing observed and unobserved nodes, where
observed nodes represent existing classes and unobserved
nodes are introduced online to fill the gaps in the
existing data hierarchy that become evident only with
the arrival of new data.
The broader impacts of this work could extend several
real world applications: Bio-security and
bio-surveillance, information retrieval, and remote
sensing among others in settings where all of the
classes are not known a priori. The educational plan
includes outreach to K-12 students and enhanced research
opportunities for undergraduate and graduate students in
computer science as well as at the intersection of
computational and life sciences. All the software,
publications, and data sets resulting from the project
will be freely disseminated to the larger research and
educational community.
Publications:
-
Sarkhan Badirli, Zeynep Akata, George Mohler, Christine Picard, Murat Dundar,
"Fine-Grained Zero-Shot Learning with DNA as Side Information," Conference on Neural Information Processing Systems (NeurIPS'21),
2021.
Online
-
Sarkhan Badirli, Zeynep Akata, Murat Dundar, “Bayesian Zero-shot Learning”
ECCV Workshops 2020.
-
Baichuan Zhang, Murat Dundar, Vachik Dave, Muhammad Al Hasan, “Dirichlet Process Gaussian Mixture for Active Online Name Disambiguation by Particle Filter,” in Proceedings of Joint Conference on Digital Library, 2019.
PDF
-
Ellen Leask, Bethany Ehlmann, Murat Dundar, “Investigating Hydrated Mineral Deposits in Tera Sirenum Mars,” Lunar and Planetary Science Conference 50, 2019.
-
Murat Dundar, Bethany Ehlmann, Ellen Leask, “Rare Phase Detections in CRISM Data at Pixel-scale by Machine Learning Generate New Discoveries about Geology at Mars Rover Landing Areas: Jezero and NE Syrtis,” Lunar and Planetary Science Conference 50, 2019.
-
Yicheng Cheng, Bartek Rajwa, Murat Dundar, "Bayesian Nonparametrics for Non-exhaustive Learning," Advances in Neural Information Processing Systems (NIPS), Bayesian Nonparametrics Workshop, 2018.
PDF
-
Ellen Leask, Bethany Ehlmann, Murat Dundar, Scott Murchie, Frank Seelos, “Challenges in the Search for Perchlorate
and Other Hydrated Minerals with 2.1μm Absorptions on Mars,” Geophysical Research Letters, 45(22), 2018.
PDF
-
Ellen Leask, Bethany Ehlmann, Murat Dundar, Scott Murchie, Frank Seelos, “New Possible CRISM Artifact at 2.1 Micrometers and Implications for Orbital Mineral Detections,” Lunar and Planetary Science Conference 49, 2018.
-
Yicheng Cheng*, Murat Dundar, George Mohler, “A Coupled ETAS-I2GMM Point Process with Applications to Seismic Fault Detection,” Annals of Applied Statistics,12(3), pp. 1853-1870, 2018.
-
Halid Z. Yerebakan* and Murat Dundar, “Partially Collapsed Parallel Gibbs Sampler for Dirichlet Process Mixture Models,” Pattern Recognition Letters, 90, pp.22-27, 2017.
-
Bethany Ehlmann and Murat Dundar, "Acidic Conditions During Open System Weathering on Late Noachian/Early Hesperian Mars? Newly Identified Outcrops of Alunite and Jarosite from Orbital CRISM Data,", AAS/Division for Planetary Sciences Meeting Abstracts,
2016.
-
Baichuan Zhang, Murat Dundar, Muhammed Hasan, "Bayesian Non-Exhaustive Classification A Case Study: Online Name Disambiguation using Temporal Record Streams," in Proceedings of ACM CIKM, Indianapolis, US, Oct 2016.
PDF
Murat Dundar and Bethany Ehlmann, "Rare Jarosite Detection in CRISM Imagery by Non-Parametric Bayesian Clustering," in Proceedings of IEEE WHISPERS'16, Los Angeles, US, Aug 2016.
PDF
-
Bartek Rajwa, Paul Wallace, Elizabeth Griffiths, Murat Dundar,
"Automated Assessment of Disease Progression in Acute Myeloid Leukemia by Probabilistic Analysis of Flow Cytometry
Data," IEEE Transactions on Biomedical Engineering,
64(5),
2017.
Online
-
Murat Dundar, Qiang Kou, Baichuan Zhang, Yicheng He,
and Bartek Rajwa, “Simplicity of Kmeans versus Deepness of Deep Learning: A Case of Unsupervised Feature Learning with Limited Data,”
In Proceedings of IEEE International Conference on Machine Learning Applications,
Miami, FL, USA, December 11-13, 2015.
PDF
-
Bethany Ehlmann and Murat Dundar,
"Are Noachian/Hesperian Acidic Waters Key to Generating Mars' Regional Scale Aluminum Phyllosilicates? The Importance of Jarosite Co-occurrences with Al-Phyllosilicate Units,"
46th Lunar and Planetary Science Conference, The
Woodlands, TX, March 16-20, 2015 (oral presentation).
PDF
-
Halid Z. Yerebakan, Bartek Rajwa, Murat Dundar, "The
Infinite Mixture of Infinite Gaussian Mixtures,"
Advances in Neural Information Processing Systems
(NIPS'14), Montreal, Canada, December 8-13, 2014. (acceptance rate: 24.6%)
PDF
-
Murat Dundar, Ferit Akova, Halid Z. Yerebakan, Bartek
Rajwa, "A Non-parametric Bayesian Model for Joint Cell
Clustering and Cluster Matching: Identification of
Anomalous Sample Phenotypes with Random Effects," BMC
Bioinformatics 15 (1), 314, 2014.
Online
-
Murat Dundar, Halid Z. Yerebakan, Bartek Rajwa,
"Batch Discovery of Recurring Rare Classes toward
Identifying Anomalous Samples," In Proceedings of the 20th Annual SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD'14), New York, USA, Aug 24-27 2014. (acceptance rate: 15%)
PDF
Video Lecture
-
Murat Dundar, Bartek Rajwa, Lin Li, “Partially-observed Models for Classifying Minerals on Mars,”
In Proceedings of WHISPERS'13, Gainesville, FL, June
25-28, 2013. PDF
Presentations:
-
Title: Bayesian Nonparametrics
for Non-exhaustive Learning
Presenter: Murat Dundar
Venue: Lunar and Planetary Science Conference 50,
The Woodlands, TX
Presentation Type: Oral
-
Title: Rare Phase Detections in CRISM Data at Pixel-scale by Machine Learning Generate New Discoveries about Geology at Mars Rover Landing Areas: Jezero and NE Syrtis
Presenter: Murat Dundar
Venue: NIPS' 18 Bayesian Nonparametrics Workshop, Montreal, CA
Presentation Type: Poster
-
Title: Bayesian Non-Exhaustive
Classification A Case Study: Online Name
Disambiguation using Temporal Record Streams
Presenter: Baichuan Zhang
Venue: ACM CIKM' 16, Indianapolis, IN
Presentation Type: Oral
-
Title: Rare Jarosite Detection in CRISM
Imagery by Non-Parametric Bayesian Clustering
Presenter: Murat Dundar
Venue: WHISPERS' 16, Los Angeles, CA
Presentation Type: Oral
-
Title: Simplicity of Kmeans
versus Deepness of Deep Learning: A Case of
Unsupervised Feature Learning with Limited Data
Presenter: Murat Dundar
Venue: ICMLA' 15, Miami, FL
Presentation Type: Oral
-
Title: The Infinite Mixture of
Infinite Gaussian Mixtures
Presenter: Halid Yerebakan
Venue: NIPS' 14, Montreal, CA
Presentation Type: Poster
-
Title: A non-parametric Bayesian model for joint cell clustering and cluster matching under random effects
Presenter: Murat Dundar
Venue: GLIIFCA'14, Oconomowoc, WI
Presentation Type: Invited
-
Title: Batch Discovery of
Recurring Rare Classes toward Identifying Anomalous
Samples Presenter: Murat Dundar Venue: KDD' 14, New York, NY Presentation Type: Oral (Video)
Other
Products:
News
Releases:
Software:
-
ASPIRE:
This is a software implemented in C++ for identifying
recurring classes (both normal and rare) across a
batch of samples that are significantly perturbed by
random effects in a completely unsupervised way.
-
I2GMM:
This is a software implemented in C++ for clustering
data sets with well-defined albeit skewed/multi-mode
clusters. It uses a two-level non-parametric
Bayesian hierarchy of Gaussian mixture models.
Invention Disclosures:
In-class Kaggle Contests:
-
CSCI 590 Machine Learning 2014.
Automated Indexing of PubMed Abstracts using MeSH
Terms
-
CSCI 590 Machine Learning 2015.
Bacteria Classification at the Genus Level
-
CSCI 590 Machine Learning 2016.
MARS Mineral Discovery Challenge
-
CSCI 590 Machine Learning 2017.
Authorship Attribution Challenge
-
CSCI 590 Machine Learning 2018.
MARS Mineral Discovery Challenge 2
-
CSCI 573 Statistical Machine Learning 2019.
Automated Indexing of PubMed Abstracts using MeSH
Terms 2
Graduate Students:
-
Halid
Ziya Yerebakan, PhD student,
2011-2017
Thesis Topic: Non-parametric
Bayesian Inference using
Partially-observed
Hierarchical Data Sets
-
Yicheng
Cheng, PhD student, 2015-2021
-
Sarkhan
Badirli, PhD student, 2016-
-
Abdulmecit Gungor, MS
student, 2016 - 2018
Abdulmecit and Sarkhan were part of the team that took 2nd place in the Roche Global Code4Life University Challenge
-
Hossein
Karimy, MS student, 2014-2015 Thesis Topic: Automated Image Classification via Unsupervised Feature Learning by K-means
-
Nathan
Hammes, MS Student, Summer
2014 Special Study: Applied
Machine Learning Nathan Hammes ranked 3rd in
the
DecMed2014: Decoding the
Human Brain challenge
among 267 teams.
Congratulations Nathan!
Undergraduate Students:
Spring 2014:
-
Jordyn Kramer Preparing wrapper files for running ASPIRE software
in R
-
Brandon Upp Implementing an XML parser for parsing clinical
records collected as part of the The Cancer Genome
Atlas (TCGA) initiative
-
Nhan Do Modeling stripe noise in hyperspectral images
acquired by CRISM (Compact Reconnaissance Imaging
Spectrometers for Mars)
Fall 2014:
Spring 2015:
-
Kelly De Waal Literature Review on Large Scale Medical Informatics
-
Yicheng He Unsupervised Feature Learning using Optical Scatters
of Bacterial Cultures
-
Nhan Do Automated MeSH Indexing for PubMed Abstracts
IUPUI Center for Research and
Learning RISE award has chosen Nhan Do to receive a
$1,500 grant to help him complete this research
project with Dr. Dundar.
Fall 2015:
Fall 2016:
Spring 2017:
Fall 2017:
Spring 2018:
Spring 2019:
Project Outcomes Report:
Pattern classes that originate in natural settings exhibit power-law property; hence, in a randomly sampled training set, no training examples are expected to be available for most of the pattern classes. As a result, contrary to common assumption, in modern machine learning problems data distributions are often
not invariant between training and test phases violating the main premise traditional machine learning operate on. The model is trained with a limited number of classes defined under ideal settings yet classification is often rendered in an uncontrolled environment where new classes of patterns and anomalies emerge on a continuous basis. This ill-defined setting creates a two-sided problem. First, the unknown class, which could potentially represent a significant abnormality such as a residual population of cancer cells in bone marrow or an emerging pathogenic bacteria strain present in food products, cannot be appropriately detected leading to potentially catastrophic consequences. Second, even if the unknown classes do not have any significance, misclassifying irrelevant samples into classes of practical importance raises serious doubts about the overall stability of the machine-learning systems and expose their
vulnerability to adversarial attacks, as has recently been the case with some well-established deep learning models.
During the course of this project we studied this problem from a non-parametric Bayesian perspective by developing flexible models that can update on the fly to better accommodate non-stationary nature of data distributions. More specifically, we have developed the infinite mixture of infinite Gaussian mixtures as a doubly non-parametric model to deal with multi-mode and skewed class distributions and extended this model to classification problems with partially observed set of classes. We have implemented scalable MCMC inference algorithms that eliminate point dependency to allow for large-scale parallelization. We have proposed a new Bayesian approach to
zero-shot learning that achieves significant improvements on both observed and unobserved class accuracies to achieve the best results on a variety of benchmark datasets among the currently published state of the art methods.
We tested our algorithms on significant machine learning problems that arise in a wide range of disciplines. Using our algorithms we have detected several rare mineral phases on Mars even in terrains with extensive traditional spectral analysis. These small detections not only offer new cues toward a more accurate and complete geologic mapping of Mars paving the way for future discoveries but are also crucial for guiding the Mars rovers and for contextualizing its discoveries. We have reported results showing that our algorithms can diagnose blood cell population abnormalities with perfect accuracy and can detect change in disease progression in blood cancer with near perfect accuracy. We have proposed a robust solution for online name disambiguation problem and demonstrated that this framework can not only identify articles written by known authors but can also identify articles written by unknown authors sharing the same name. We also developed a new point process model for space-time-magnitude modeling of earthquake events.
|