DOWNLOAD

Download our previously published works data and source codes.

Positive Synergy Index

In the past decades, a few synergistic feature selection algorithms have been published, which includes Cooperative Index (CI) and K-Top Scoring Pair (k-TSP). These algorithms consider the synergistic behavior of features when they are included in a feature panel.

Although promising results have been shown for these algorithms, there is lack of a comprehensive and fair comparison with other feature selection algorithms across a large number of microarray datasets in terms of classification accuracy and computational complexity.

There is a need in evaluating their performance and reducing the complexity of such algorithms. We compared the performance of synergistic feature selection algorithms with 11 other commonly used algorithms based on 22 microarray gene expression binary class datasets. The evaluation confirms that synergistic algorithms such as CI and k-TSP will gradually increase the classification performance as more features are used in the classifiers.

Also, in order to cut down computational cost, we proposed a new feature selection ranking score called Positive Synergy Index (PSI). Testing results show that features selected using PSI as well as synergistic feature selection algorithms provide better performance compared to with all other methods, while PSI has a computational complexity significantly lower than that of other synergistic algorithms.

Bari, M. G., Salekin, S. and Zhang, J. (2016), A Robust and Efficient Feature Selection Algorithm for Microarray Data. Mol. Inf.. doi:10.1002/minf.201600099.

Early Response Index

Identifying disease correlated features early before large numbers of molecules are impacted by disease progression with significant abundance change is very advantageous to biologists for developing early disease diagnosis biomarkers. Disease correlated features have relatively low level of abundance change at early stages. Finding them using existing bioinformatic tools in high throughput data is a challenging task since the technology suffers from limited dynamic range and significant noise. Most existing biomarker discovery algorithms can only detect molecules with high abundance changes, frequently missing early disease diagnostic markers.

In this work, we present a new statistic called early response index (ERI) to prioritize disease correlated molecules as potential early biomarkers. Instead of classification accuracy, ERI measures the average classification accuracy improvement attainable by a feature when it is united with other counterparts for classification. ERI is more sensitive to abundance changes than other ranking statistics. We have shown that ERI significantly outperforms SAM and Localfdr in detecting early responding molecules in a proteomics study of a mouse model of multiple sclerosis.

Importantly, ERI was able to detect many disease relevant proteins before those algorithms detect them at a later time point. ERI method is more sensitive for significant feature detection during early stage of disease development. It potentially has a higher specificity for biomarker discovery, and can be used to identify critical time frame for disease intervention.

Source codes

Download PSI algorithm sourse code from GitHub repository. The code has written in Matlab to rank the features in gene expression microarray data.

Datasets

In PSI study, 22 microarray datasets collected from different types of cancers are used to evaluate various feature selection algorithms. The MAT version of datasets are provided in table 1

Also, Table 2 and Table 3 provide the dataset used for ERI study.

Table 1: PSI - Datssets

# Data Name Features Class-sizes Download
1 All 12625 95/33
2 Brain 12626 15/14
3 Breast1 4948 44/34
4 Breast2 22284 138/71
5 Carcinoma 7457 18/18
6 CNS 7129 39/21
7 Colon 2000 40/22
8 DLBCL1 7129 58/19
9 DLBCL2 4026 24/23
10 GCM 16063 190/90
11 GLI-85 22283 59/26
12 GSE14333 54675 138/91
13 GSE24514 22215 34/15
14 GSE27854 54675 58/57
15 Leukemia 7129 47/25
16 Lung 12533 150/31
17 Ovarian 15154 162/91
18 Prostate1 10509 52/50
19 Prostate2 12626 50/38
20 Prostate3 12600 78/58
21 SMK 19993 97/90
22 SRBCT 2308 29/25

Table 2: ERI - EAE Datsset

# Description Download
1 EAE dataset between day 0 and 5
2 EAE dataset between day 0 and 25
3 All other days of EAE dataset

Table 3: ERI - Clinical Datssets

Dataset Genes Sample class size Download
GSE14333 54675 138/91
GSE27854 54675 57/58
Breast Cancer 22284 138/71
CNS 7129 12/39
Colon Cancer 2000 40/22
GLI-85 22283 26/59
Lung Cancer 7129 24/62
Prostate Cancer 10509 50/52
SMK-CAN-187 19993 90/97