Computational Methods to Explore Big Bioassay Data for Better Compound Prioritization

Bioassay data represent an extremely valuable source of experimental Big Data with rich content that have been substantially produced in the early stages of drug discovery for testing chemical compound bioactivities and identifying promising drug candidates. However, the power of such Big bioassay data has not been fully unleashed, particularly for the purposes of discovering novel knowledge and improving drug development. This is largely due to the fact that the exploration of a much larger space of bioassays has been fundamentally hindered by the less developed ability to identify and utilize the relations across bioassays. In this project, the PI and her team will develop novel computational methods and tools that can effectively explore a wide range of heterogeneous bioassays, identify experimentally unrevealed relations among them, and utilize the novel knowledge derived from them so as to improve compound prioritization. The research will bring scientific impacts and shed light on fully utilizing the existing wealth of Big Data, stimulating knowledge distillation in innovative manners, establishing visionary conceptual hypotheses and developing novel analytical techniques correspondingly. This research aims to solve critical problems in drug discovery through Big Data means, and has a great potential to improve drug candidate identification through accurate compound prioritization, and thus it will have far-reaching economic and societal impacts.

The PI and her team will develop a computational framework to produce better compound ranking for each bioassay. This framework will consist of a local structure learning component and a global structure learning component to discover and leverage the compound ranking within a bioassay and ranking relations across bioassays, respectively. They will also develop new methods to better rank compounds under a combination of criteria. In particular, they will solve compound ranking based on activity and selectivity simultaneously by leveraging ranking difference across bioassays. The research will be innovative, both in terms of employing original computational models and methods into important problems in drug discovery, and in terms of developing unique methodologies and computational techniques for core Computer Science research. For drug discovery, the research will provide novel perspectives and methodologies as to how researchers can utilize the large-scale experimental data to solve important problems in drug discovery. For core Computer Science, the research will contribute a new solution framework and methods spanning the areas of data mining and machine learning. Specifically, the research will lead to novel methods for boosting ranking performance by actively including additional data, incorporating relevant information within a regularized optimization framework, deploying iterative procedures and greedy strategies for large-scale problems with multiple simultaneous tasks, etc. All these methods are generalizable to a variety of other Computer Science applications.

Research Products

  • Multi-assay-based compound prioritization

    Junfeng Liu and Xia Ning*. Multi-assay-based compound prioritization via assistance utilization: A machine learning framework. Journal of Chemical Information and Modeling, 57(3):484--498, 2017. PMID: 28234477. [ bib | DOI | www ]

    Abstract: Effective prioritization of chemical compounds that show promising bioactivities from compound screenings represents a first critical step toward identifying successful drug candidates. Current development on computational approaches for compound prioritization is largely focused on devising advanced ranking algorithms that better learn the ordering among compounds. However, such methodologies are fundamentally limited by the scarcity of available data, particularly when the screenings are conducted at a relatively small scale over known promising compounds. Instead, in this work, we explore the structures of bioassay space and leverage such structures to improve ranking performance of an existing strong ranking algorithm. This is done by identifying assistance bioassays and assistance compounds intelligently and leveraging such assistance within the existing ranking algorithm. By leveraging the assistance bioassays and assistance compounds, the data scarcity can be properly compromised. Along this line, we develop a suite of assistance bioassay selection methods and assistance compound selection methods. Our experiments demonstrate an overall 8.34% improvement on the ranking performance over the state of the art.

    Code and data are available at here.

  • Differential compound prioritization

    Junfeng Liu and Xia Ning*. Differential compound prioritization via bi-directional selectivity push with power. In The 8th ACM Conference on Boinformatics, Computational Biology, and Health Informatics, 2017. accepted. [ bib ]

    Abstract: ffective in silico compound prioritization is critical to identify promising candidates in the early stages of drug discovery. Current methods typically focus on compound ranking based on one single property, for example, activity, against a single target. However, compound selectivity is also a key property that should be deliberated simultaneously so as to reduce the likelihood of undesired side effects of future drugs. In this paper, we present a novel machine-learning based differential compound prioritization method dCPPP. This dCPPP method learns compound prioritization models that rank active compounds well, and meanwhile, preferably rank selective compounds higher via a bi-directional push strategy. The bi-directional push is enhanced with push powers that are determined by ranking difference of selective compounds over multiple bioassays. Our experiments demonstrate that the dCPPP achieves an overall 19.22% improvement on prioritizing selective compounds over baseline models.

    Code and data are available at here.

This material is based upon work supported by the National Science Foundation under Grant Number IIS-1566219. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.