SCFIA: Statistical Corresponding Features Identification Algorithm for LC/MS

Jian Cui 1, Xuepo Ma1, Long Chen1, Ashoka Polpitiya 2 and Jianqiu Zhang 1*


1Department of Electrical and Computer Engineering, the University of Texas at San Antonio, One

UTSA Circle, San Antonio, TX 78249

2Center for Proteomics,Translational Genomics Research Institute, 445 N. 5th St. 4th flr, Phoenix, AZ 85004.


Email addresses:
Jianqiu Zhang:

Jian Cui:


Long Chen:



Identifying corresponding features (LC peaks registered by the same peptide) in multiple Liquid Chromatography/Mass Spectrometry (LC/MS) datasets plays a crucial role in the analysis of complex peptide or protein mixtures. Warping functions are commonly used to correct elution time shifts between two different LC/MS datasets to identify corresponding features. Although a warping function can correct the mean difference of elution time shifts, it alone cannot resolve the ambiguity completely because elution time shifts are random. Instead, we propose a Statistical Corresponding Feature Identification Algorithm(SCFIA) based on both time shift and the similarity of LC peak shapes between corresponding feature pairs. SCFIA first trains statistical models of corresponding features, and then, all candidate corresponding features are scored by these statistical models to find the maximum likelihood match of corresponding features. We test our algorithm on public available datasets and we compare its performance with that of warping function based methods. The accuracy and the number of aligned features are improved significantly with our method.



In this paper, we proposed a new method called Statistical Corresponding Features Identification Algorithm (SCFIA) to identify the corresponding features in different datasets. We verify the algorithm on two Super-SILAC datasets and the performance is better than the warping function and OpenMS. The SCFIA is proved to be stable when we choose different prophet score. Then we apply our SCFIA to three datasets of two data groups. The first group is fraction data and second group is replicate data. The result part shows that we can identify much more peptides in three datasets than their intersection. Our algorithm is to figure out the intervals of peptides in their union. In the future, we plan to focus on peptide identification on multiple LC/MS datasets without LC-MS/MS information.

Data, figure, result and source code

·         Data is available at

·         The first group is

20090608_Orbi6_TaGe_SA_TUMOR_5mix1_01.raw (dataset Q1)

20090608_Orbi6_TaGe_SA_TUMOR_5mix1_02.raw (dataset Q2)

20090608_Orbi6_TaGe_SA_TUMOR_5mix1_03.raw (dataset Q3)

·         The second group is

200090815_Velos5_TaGe_SA_Silacmix_TOP15_01.raw (dataset Q1)

200090815_Velos5_TaGe_SA_Silacmix_TOP15_01.raw (dataset Q2)

200090815_Velos5_TaGe_SA_Silacmix_TOP15_01.raw (dataset Q3)

·         The demo code file:

·         Algorithm verification demo

·         Group1 Data X!tandem verification demo

·         Group1 Data MaxQuant verification demo

·         Group2 Data X!tandem verification demo

·         Group2 Data MaxQuant verification demo