Predicting Protein Abundance based on Mass Spectrometry using Machine Learning (#137)
Motivation:
Protein quantification has long been an interesting and paramount area in proteomics field. MaxQuant, currently one of the most commonly used softwares for protein quantification, employs state-of-the-art peptide intensities based methods. Using state-of-the-art protein quantification methods such as iBAQ, it has been shown that the accuracy of protein quanification is dependent on its abundance such that highly abundant proteins can be much more accurately quantified compared with low abundance proteins.. Other peptide ion intensities based quantification methods have the same problem. (Wilhelm, et al. 2014)
Protein peptide mass spectrometry intensities have been verified to be correlated with its protein abundance and chemical and biophysical features (Scherbart, et al. 2009). In this study, we development a machine learning to model the relationship between chemical and biophysical peptide features and its protein abundance in an attempt to improve MS-based protein quantification accuracy
Results:
Machine learning models including Support Vector Machine and Artificial Neural Networks (including Deep Learning Models) with various topologies was developed and used to model a set of 690 chemical and biophysical features including AAIndex features (Kawashima, et al 2008) associated with each peptide.
First, our machine learning models are trained on AQUA data of U2OS (4490 proteins), and the pearson correlation coefficience R between our predicted protein abundance using cross validation and AQUA abundance is substantially higher than that between AQUA and iBAQ abundance (Rpearson increases from 0.806 to 0.90578). This indicates our developed protein and peptide feature set can assist improving the accuracy of calculating protein abundance via machine learning models.
Second, we compared our peptide feature-based protein abundance prediction model with MaxLFQ (Cox, et al. 2014) using, a dataset consisting of proteins spiked in with known ratios of SILAC labelled protein. Our model was trained using a set of proteins where the SILAC ratio accurately corresponded with the expected ratio. This model was then applied in predicting abundance of all the proteins (4446 proteins) in the dataset. To date, the predicted protein abundances also shows the potential of outputing protein abundance ratio distribution better than MaxLFQ and our research is continuing for further optimizing the performance of the models.
Conclusion:
Our study shows that by considering peptide features when computing protein abundance estimation from MS-based data the accuracy of the measurement can be improved. This has important implications particularly in label free protein quantification and studies to correlate gene expression and protein abundance.
- Kawashima, S., Ogata, H., and Kanehisa, M.; AAindex: amino acid index database. Nucleic Acids Res. 27, 368-369 (1999).
- Cox, J., et al. (2014). "MaxLFQ allows accurate proteome-wide label-free quantification by delayed normalization and maximal peptide ratio extraction." Molecular & Cellular Proteomics: mcp. M113. 031591.
- Scherbart, A., et al. (2009). Improved mass spectrometry peak intensity prediction by adaptive feature weighting. Advances in Neuro-Information Processing, Springer: 513-520.
- Wilhelm, M., et al. (2014). "Mass-spectrometry-based draft of the human proteome." Nature 509(7502): 582-587.