Database, Data Mining & Machine Learning (DDML) Research Group


We develop algorithms for machine learning i.e., automatically learning models, patterns, and skills from potentially big data. We then use these algorithms to discover new approaches to designing artificially intelligent systems, as well as to find novel computational solutions to biological, medical, and other complex real-world scientific/technological problems.


Geology │ Medical data

Current Projects


The Data Mining Group has a number of project in areas such as interactive pattern mining, subgraph mining algorithms, graph classification, etc.  Details of their projects can be found here.

Fast Reinforcement Learning Using Multiple Models and State Decomposition: This project is a collaboration with Yale University and is funded by a grant from the National Science Foundation (NSF). Intelligent behavior in both natural and man-made systems consists in being repeatedly successful in achieving some desired goals in diverse, observably different situations on the basis of past experience. Learning is central to such behavior, since in both cases, mechanisms have to exist which yield rapid improvement with minimum a priori information. In fact, organizing, coordinating, and executing diverse tasks such as manipulation of effectors, obstacle avoidance, path planning, scene analysis, tracking which are common to both classes of systems, involve learning. The principal objective of this project is to address this important problem using two different methods: (a) the use of multiple identification models, and (b) decomposition of high dimensional state and action spaces. The project elaborates on the different ways in which (a) can be used to improve convergence. In (b) multiple agents with lower dimensional state spaces are used in place of high dimensional state and action spaces to overcome "the curse of dimensionality". The research described above will find application in situations where rapid learning is mandatory. One such area is the control of a fleet of Plug-in, Hybrid, Electric Vehicles (PHEVs). Given a fleet of vehicles, the objective reduces to a complex optimization problem of orchestrating switching between internal combustion engines and electric engines, under a variety of constraints.

WRESTORE (A secure decision support system for coordination of adaptation planning among Food, Energy, and Water actors in the Pacific Northwest): This project is a collaboration with Oregon State University and is funded by a grant from the National Science Foundation (NSF)/U.S. Department of Agriculture (USDA). Given the increasingly strong evidence for emerging climate change and economic trends, coordination of adaptation decisions for managing limited natural resources - such as water and arable land - in food, energy, and water (FEW) sectors, are expected to become increasingly critical. The goal of this project is to establish a novel, intelligent, secure, and human computation-based decision support system that will enable local and regional community actors to coordinate and co-identify robust adaptation decisions for natural resources management in FEW systems, when chronic and/or acute physical and socio-economic perturbations occur.

Computational modeling of grievances and political instability through global media: This project is funded by an INSPIRE grant from the National Science Foundation (NSF).  This project focuses on developing point process models for measuring the level of cross-excitation between social media, web content, and conflict and political instability. A specific application involves predicting instability around elections in Nigeria using election-related Twitter posts.

Algorithms for crime prediction: There are several ongoing projects aimed at designing algorithms for crime prediction and testing predictive policing software in randomized control field trials. This is joint work with the Los Angeles Police Department and collaborators at UCLA.

CAREER: Self-adjusting Models as a New Direction in Machine Learning: This study explores a new class of machine learning algorithms that produce self-adjusting models that can accommodate new classes observed in data in offline as well as online learning scenarios. The project aims to (i) use non-parametric models to dynamically incorporate the changing number of classes; (ii) develop new online and offline inference techniques to accommodate new classes as they emerge; (iii) automatically associate newly discovered classes with higher-level groups of classes in an attempt to identify potentially interesting class formations; and (iv) develop partially-observed tree models containing observed and unobserved nodes, where observed nodes represent existing classes and unobserved nodes are introduced online to fill the gaps in the existing data hierarchy that become evident only with the arrival of new data.

Computational Methods to Explore Big Bioassay Data for Better Compound Prioritization: Bioassay data represent an extremely valuable source of experimental Big Data with rich content that have been substantially produced in the early stages of drug discovery for testing chemical compound bioactivities and identifying promising drug candidates. However, the power of such Big bioassay data has not been fully unleashed, particularly for the purposes of discovering novel knowledge and improving drug development. This is largely due to the fact that the exploration of a much larger space of bioassays has been fundamentally hindered by the less developed ability to identify and utilize the relations across bioassays. In this project, the PI and her team will develop novel computational methods and tools that can effectively explore a wide range of heterogeneous bioassays, identify experimentally unrevealed relations among them, and utilize the novel knowledge derived from them so as to improve compound prioritization. The research will bring scientific impacts and shed light on fully utilizing the existing wealth of Big Data, stimulating knowledge distillation in innovative manners, establishing visionary conceptual hypotheses and developing novel analytical techniques correspondingly. This research aims to solve critical problems in drug discovery through Big Data means, and has a great potential to improve drug candidate identification through accurate compound prioritization, and thus it will have far-reaching economic and societal impacts.

Mining Drug-Drug Interaction Induced Adverse Effects from Health Record Databases: Recent advances in large-scale electronic health record database techniques provide exciting new opportunities to the study of drug safety. Drug-drug interactions (DDIs), a major cause of adverse drug events (ADEs), are a serious global health concern, and a severe detriment to public health. The scale of DDIs involving three or more drugs (also called high-order DDIs) has posed a prohibitory challenge for its molecular pharmacology and clinical research, which motivates alternative strategies such as mining health record data. This project aims to develop large-scale computational strategies and effective software tools for mining high-order DDI effects from health record databases, in order to yield novel discoveries in drug safety, and ultimately to benefit national health and well being.