389 380

Cited 0 times in

Development of bioinformatics platform for analyzing MS-based protein identification and quantification

Authors
 조진영 
Issue Date
2017
Description
Integrated OMICS for Biomedical Science World Class University/박사
Abstract
Approximately 2.9 billion long base-pair human reference genome sequences are known to encode some 20,000 representative proteins. However, 3,000 proteins, i.e., about 15% of all proteins, have no or very weak proteomic evidence and still missing, termed missing protein. Missing proteins may be present in rare samples at very low abundance or with only temporary expression, causing some problems in their detection for protein profiling. In particular, some technical limitations cause those missing proteins remain unassigned. For example, current mass spectrometry (MS) techniques have detection limits and high error rates for complex biological samples. Insufficient proteome coverage of a reference sequence database (DB) and a spectral library also major issues. Thus, the development of a better search strategy that results in greater sensitivity and more accurate in search of missing proteins is necessary. To this end, we used a new strategy, which combines a reference spectral library searching and a simulated spectral library (simSPL) searching to identify missing proteins. We built the human iRefSPL, which contains the original human reference spectral library and additional peptide sequence-spectrum match entries from other species. We also built the human simSPL, which contains simulated spectra of 173,907 human tryptic peptides by MassAnalyzer (version 2.3.1).
To prove the enhanced analytical performance of the combination of human iRefSPL and simSPL method, called “Combo-Spec Search method”, for the identification of missing proteins, we attempted to re-analyze the placental tissue dataset (PXD000754). Each experiment data was analyzed by PeptideProphet, and the results were combined by iProphet. For the quality control, we applied class-specific false-discovery rate (FDR) filtering method. All results were filtered at less than 1% FDR in peptide and protein level. The quality controlled results were cross-checked with the neXtProt DB (2014-09-19 release). The two spectral libraries, iRefSPL and simSPL were designed to have no overlapped proteome coverage. They showed complementary in spectral library searching and significantly increased the number of matches. From this trial, 12 missing proteins were newly identified, which passed the criterion—Least two of 7 or more length amino acid peptides or one of 9 or more lengths amino acid peptide with one or more unique sequence. Thus, the use of the iRefSPL and simSPL combination can be helped to identify peptides that had not been detected by conventional sequence DB searches with improved sensitivity and low error rate.
We developed a new analytical software, called Epsilon-Q. This software is designed to support Combo-Spec Search and label-free quantification method. Epsilon-Q supports standard MS data format and connects with SpectraST to match spectrum-to-spectrum. Epsilon-Q automatically performs three operations: raw MS data indexing, multiple spectral library searching and calculating sum of precursor ion peak intensities for user input datasets. By using the multi-threading function, Epsilon-Q can performs multiple spectral library searching and parsing the results. With user friendly graphical interface, Epsilon-Q has shown a good performance to identify and quantify proteins. Especially, for low abundance proteins in biological samples, Epsilon-Q has outperformed other sequence DB search engines. Thus, we anticipate that Epsilon-Q software helps users to get improved detectability in identifying proteins and to perform comparative analysis of biological samples.

인간에게는 약 29억 염기 쌍의 길이의 유전체 서열이 있으며, 여기에 약 20,000여 개의 대표 단백질들의 발현 정보가 들어 있다고 알려져 있다. 하지만, 이 가운데 약 15% 정도에 해당하는 단백질들은 실험적인 존재 규명 근거가 미비하여 “미확인 단백질 (missing protein)”이라 불린다. 이 단백질들은 극히 국소적인 부분에서 미량으로 발현되는 이유로 발견이 어려울 것이라 추정되며, 단백질 분석의 기술적인 한계도 여기에 일조한다. 이를테면 현재의 질량분석 기법은 복잡한 생물학적인 시료 분석을 완벽하게 분석하는데 한계가 있으며, 서열 DB 검색의 정확도와 펩타이드 라이브러리의 제한된 단백질 분석 가능 범위도 해결해야 할 과제이다. 이러한 문제를 해결하...
Files in This Item:
T014410.pdf Download
Appears in Collections:
1. College of Medicine (의과대학) > Others (기타) > 3. Dissertation
URI
https://ir.ymlib.yonsei.ac.kr/handle/22282913/154947
사서에게 알리기
  feedback

qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

Browse

Links