This software implements Anomalous Sample Phenotype Identification with Random Effects (ASPIRE) for joint clustering and cluster matching across a batch of samples. We model each sample data by a mixture of potentially infinitely many Dirichlet process Gaussian mixture models (DPMs) with each individual DPM modeling the local distribution of a single class or a meta-cluster. Each DPM itself is a mixture of potentially infinitely many Gaussian distributions with the actual number of mixture components determined directly from the data during inference. Thus, modeling local class distributions by DPMs offers the flexibility needed to accommodate data distributions that are skewed and/or multi-modal.
As local distributions of a given class are noisy realizations of the true class distribution we introduce a sharing mechanism to create dependencies across DPMs associated with the same class. This is achieved by centering the base distributions of DPMs associated with the same class on a unique global parameter, which itself is distributed according to a higher level DPM. This global DPM not only associates local distributions of a given class with one another but also models the number and proportions of classes in each sample.
We use a collapsed Gibbs sampler to perform inference. Model learning, which is performed in a single unified process, involves three main tasks: recovering DPMs in each sample, finding class associations of DPMs, and identifying the total number of classes and their proportions in each sample.
ASPIRE is capable of identifying recurring classes (both normal and rare) across a batch of samples that are significantly perturbed by random effects in a completely unsupervised way and can characterize normal as well as anomalous states given only very weak assumptions regarding sample characteristics and origin.
The algorithm is developed in C++. The executable is generated in Windows 7 64 bit environment, using 32-64 bit releases with compiler configuration. Multi-threading is provided by C++11 features. Although an example Matlab wrapper function (demo) is provided in the package, the software can also run as a stand-alone executable if required input files are provided in the same format.
Input files are in binary matrix file format. In this format two leading integers defines the matrix row and column and the following doubles are matrix entries.
1.0 1.0 1.0
2.0 2.0 2.0
Copy everything in the comprossed file into a new directory and add the location of the directory into the Matlab path. Run aspire_demo.m in the command window. If the demo runs without any problems that means the executable is compatible with your system and you are ready to use the software. If you receive an error then try to compile files. Create an empty project, add source files into your project, compile in release mode with a compiler compatible with c++11. Once you get the executable, replace with the existing aspiremXX.exe in the folder. If you want to run from your IDE you should provide the path of the files as a command line argument. Command line use of the aspire is given as follows
aspire.exe datafile groupidsfile priorfile parametersfile #sweeps #burnin #threads
Matrix size for files should be n X d, n X 1 , (d+1) X d , 1 X 6 respectively. datafile: a data matrix with each row hosting an individual data point in a d dimensional space.
groupidsfile: a vector of group ids.
priorfile: file containing the prior covariance and prior mean.
parametersfile: file containing the hyperparameters.
The last three arguments are optional and default values are 1000 100 and 1 respectively.
Source code also includes the following third party files.
gammaln from the lightspeed package. Please see comment lines in these scripts
for more information.
ASPIRE is released under the terms of the GNU General Public License as published by the Free Software Foundation
without any warranty.