RESUMO
The rapid development of proteomics studies has resulted in large volumes of experimental data. The emergence of big data platform provides the opportunity to handle these large amounts of data. The integrated proteome resource, iProX (https://www.iprox.cn), which was initiated in 2017, has been greatly improved with an up-to-date big data platform implemented in 2021. Here, we describe the main iProX developments since its first publication in Nucleic Acids Research in 2019. First, a hyper-converged architecture with high scalability supports the submission process. A hadoop cluster can store large amounts of proteomics datasets, and a distributed, RESTful-styled Elastic Search engine can query millions of records within one second. Also, several new features, including the Universal Spectrum Identifier (USI) mechanism proposed by ProteomeXchange, RESTful Web Service API, and a high-efficiency reanalysis pipeline, have been added to iProX for better open data sharing. By the end of August 2021, 1526 datasets had been submitted to iProX, reaching a total data volume of 92.42TB. With the implementation of the big data platform, iProX can support PB-level data storage, hundreds of billions of spectra records, and second-level latency service capabilities that meet the requirements of the fast growing field of proteomics.
Assuntos
Bases de Dados de Proteínas , Proteoma/genética , Proteômica , Software , Big Data , Biologia Computacional/normas , Disseminação de InformaçãoRESUMO
The lysine succinylation (Ksucc) is involved in many core energy metabolism pathways and affects the metabolic process in mitochondria, making this modification highly valuable for studying diseases related to mitochondrial disorders. In this paper, we used liquid chromatography with tandem mass spectrometry (LC-MS/MS) to perform the first global profiling of succinylation in human lungs under normal physiological conditions. Using an MS-based platform, we identified 1485 Ksucc sites in 568 proteins. We then compared these sites with those previously identified in human succinylome studies to investigate specific succinylated proteins and identify their possible functions in the lung and to explore the substrate preferences of succinylation modifiers in different cell lines and at different subcellular localizations. Our work expands the succinylation database and supplementary materials on the human succinylome and will thus help in further study of the function of Ksucc and regulation under related physiological and pathological conditions.
Assuntos
Lisina , Espectrometria de Massas em Tandem , Cromatografia Líquida , Humanos , Pulmão/metabolismo , Lisina/metabolismo , Processamento de Proteína Pós-Traducional , Proteoma/metabolismoRESUMO
BACKGROUND: Aging is a complex biological process accompanied by a time-dependent functional decline that affects most living organisms. Omics studies help to comprehensively understand the mechanism of aging and discover potential intervention methods. Old mice are frequently obese with a fatty liver. METHODS: We applied mass spectrometry-based phosphoproteomics to obtain a global phosphorylation profile of the liver in mice aged 2 or 18 months. MaxQuant was used for quantitative analysis and PCA was used for unsupervised clustering. RESULTS: Through phosphoproteome analysis, a total of 5,685 phosphosites in 2,335 proteins were filtered for quantitative analysis. PCA analysis of both the phosphoproteome and transcriptome data could distinguish young and old mice. However, from kinase prediction, kinase-substrate interaction analysis, and KEGG functional enrichment analysis done with phosphoproteome data, we observed high phosphorylation of fatty acid biosynthesis, ß-oxidation, and potential secretory processes, together with low phosphorylation of the Egfr-Sos1-Araf/Braf-Map2k1-Mapk1 pathway and Ctnnb1 during aging. Proteins with differentially expressed phosphosites seemed more directly related to the aging-associated fatty liver phenotype than the differentially expressed transcripts. The phosphoproteome may reveal distinctive biological functions that are lost in the transcriptome. CONCLUSIONS: In summary, we constructed a phosphorylation-associated network in the mouse liver during normal aging, which may help to discover novel antiaging strategies.
RESUMO
Sharing of research data in public repositories has become best practice in academia. With the accumulation of massive data, network bandwidth and storage requirements are rapidly increasing. The ProteomeXchange (PX) consortium implements a mode of centralized metadata and distributed raw data management, which promotes effective data sharing. To facilitate open access of proteome data worldwide, we have developed the integrated proteome resource iProX (http://www.iprox.org) as a public platform for collecting and sharing raw data, analysis results and metadata obtained from proteomics experiments. The iProX repository employs a web-based proteome data submission process and open sharing of mass spectrometry-based proteomics datasets. Also, it deploys extensive controlled vocabularies and ontologies to annotate proteomics datasets. Users can use a GUI to provide and access data through a fast Aspera-based transfer tool. iProX is a full member of the PX consortium; all released datasets are freely accessible to the public. iProX is based on a high availability architecture and has been deployed as part of the proteomics infrastructure of China, ensuring long-term and stable resource support. iProX will facilitate worldwide data analysis and sharing of proteomics experiments.
Assuntos
Biologia Computacional/métodos , Bases de Dados de Proteínas , Proteoma/metabolismo , Proteômica/métodos , Animais , Humanos , Armazenamento e Recuperação da Informação/métodos , Internet , Metadados/estatística & dados numéricos , Interface Usuário-ComputadorRESUMO
Lysine crotonylation (Kcr) is a recently discovered post-translational modification that potentially regulates multiple biological processes. With an objective to expand the available crotonylation datasets, LC-MS/MS is performed using mouse liver samples under normal physiological conditions to obtain in vivo crotonylome. A label-free strategy is used and 10 034 Class I (localization probabilities > 0.75) crotonylated sites are identified in 2245 proteins. The KcrE, KcrD, and EKcr motifs are significantly enriched in the crotonylated peptides. The identified crotonylated proteins are mostly enzymes and primarily located in the cytoplasm and nucleus. Functional enrichment analysis based on Gene Ontology and Kyoto Encyclopedia of Genes and Genomes shows that the crotonylated proteins are closely related to the purine-containing compound metabolic process, ribose phosphate metabolic process, carbon metabolism pathway, ribosome pathway, and a series of metabolism-associated biological processes. To the best of the authors' knowledge, this research provides the first report on the mouse liver crotonylome. Furthermore, it offers additional evidence that crotonylation exists in non-histone proteins, and is likely involved in various biological processes. The mass spectrometry proteomics data have been deposited in the ProteomeXchange Consortium with the dataset identifiers PXD019145.
Assuntos
Lisina , Proteoma , Animais , Cromatografia Líquida , Fígado/metabolismo , Lisina/metabolismo , Camundongos , Processamento de Proteína Pós-Traducional , Proteoma/metabolismo , Espectrometria de Massas em TandemRESUMO
Mass spectrometry (MS) has become a predominant choice for large-scale absolute protein quantification, but its quantification accuracy still has substantial room for improvement. A crucial issue is the bias between the peptide MS intensity and the actual peptide abundance, i.e., the fact that peptides with equal abundance may have different MS intensities. This bias is mainly caused by the diverse physicochemical properties of peptides. Here, we propose an algorithm for label-free absolute protein quantification, LFAQ, which can correct the biased MS intensities by using the predicted peptide quantitative factors for all identified peptides. When validated on data sets produced by different MS instruments and data acquisition modes, LFAQ presented accuracy and precision superior to those of existing methods. In particular, it reduced the quantification error by an average of 46% for low-abundance proteins. The advantages of LFAQ were further confirmed using the data from published papers.
Assuntos
Algoritmos , Peptídeos/análise , Proteínas de Saccharomyces cerevisiae/análise , Animais , Cromatografia Líquida/métodos , Células HEK293 , Humanos , Camundongos , Células RAW 264.7 , Saccharomyces cerevisiae/química , Espectrometria de Massas em Tandem/métodos , Espectrometria de Massas em Tandem/estatística & dados numéricosRESUMO
Although the "missing protein" is a temporary concept in C-HPP, the biological information for their "missing" could be an important clue in evolutionary studies. Here we classified missing-protein-encoding genes into two groups, the genes encoding PE2 proteins (with transcript evidence) and the genes encoding PE3/4 proteins (with no transcript evidence). These missing-protein-encoding genes distribute unevenly among different chromosomes, chromosomal regions, or gene clusters. In the view of evolutionary features, PE3/4 genes tend to be young, spreading at the nonhomology chromosomal regions and evolving at higher rates. Interestingly, there is a higher proportion of singletons in PE3/4 genes than the proportion of singletons in all genes (background) and OTCSGs (organ, tissue, cell type-specific genes). More importantly, most of the paralogous PE3/4 genes belong to the newly duplicated members of the paralogous gene groups, which mainly contribute to special biological functions, such as "smell perception". These functions are heavily restricted into specific type of cells, tissues, or specific developmental stages, acting as the new functional requirements that facilitated the emergence of the missing-protein-encoding genes during evolution. In addition, the criteria for the extremely special physical-chemical proteins were first set up based on the properties of PE2 proteins, and the evolutionary characteristics of those proteins were explored. Overall, the evolutionary analyses of missing-protein-encoding genes are expected to be highly instructive for proteomics and functional studies in the future.
Assuntos
Cromossomos Humanos , Proteínas/fisiologia , Evolução Molecular , Duplicação Gênica , Humanos , Proteínas/química , Proteínas/genéticaRESUMO
As part of the Chromosome-Centric Human Proteome Project (C-HPP) mission, laboratories all over the world have tried to map the entire missing proteins (MPs) since 2012. On the basis of the first and second Chinese Chromosome Proteome Database (CCPD 1.0 and 2.0) studies, we developed systematic enrichment strategies to identify MPs that fell into four classes: (1) low molecular weight (LMW) proteins, (2) membrane proteins, (3) proteins that contained various post-translational modifications (PTMs), and (4) nucleic acid-associated proteins. Of 8845 proteins identified in 7 data sets, 79 proteins were classified as MPs. Among data sets derived from different enrichment strategies, data sets for LMW and PTM yielded the most novel MPs. In addition, we found that some MPs were identified in multiple-data sets, which implied that tandem enrichments methods might improve the ability to identify MPs. Moreover, low expression at the transcription level was the major cause of the "missing" of these MPs; however, MPs with higher expression level also evaded identification, most likely due to other characteristics such as LMW, high hydrophobicity and PTM. By combining a stringent manual check of the MS2 spectra with peptides synthesis verification, we confirmed 30 MPs (neXtProt PE2 â¼ PE4) and 6 potential MPs (neXtProt PE5) with authentic MS evidence. By integrating our large-scale data sets of CCPD 2.0, the number of identified proteins has increased considerably beyond simulation saturation. Here, we show that special enrichment strategies can break through the data saturation bottleneck, which could increase the efficiency of MP identification in future C-HPP studies. All 7 data sets have been uploaded to ProteomeXchange with the identifier PXD002255.
Assuntos
Proteínas/química , Proteoma , Adulto , Idoso , Idoso de 80 Anos ou mais , Linhagem Celular , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Espectrometria de Massas em TandemRESUMO
Investigations of missing proteins (MPs) are being endorsed by many bioanalytical strategies. We proposed that proteogenomics of testis tissue was a feasible approach to identify more MPs because testis tissues have higher gene expression levels. Here we combined proteomics and transcriptomics to survey gene expression in human testis tissues from three post-mortem individuals. Proteins were extracted and separated with glycine- and tricine-SDS-PAGE. A total of 9597 protein groups were identified; of these, 166 protein groups were listed as MPs, including 138 groups (83.1%) with transcriptional evidence. A total of 2948 proteins are designated as MPs, and 5.6% of these were identified in this study. The high incidence of MPs in testis tissue indicates that this is a rich resource for MPs. Functional category analysis revealed that the biological processes that testis MPs are mainly involved in are sexual reproduction and spermatogenesis. Some of the MPs are potentially involved in tumorgenesis in other tissues. Therefore, this proteogenomics analysis of individual testis tissues provides convincing evidence of the discovery of MPs. All mass spectrometry data from this study have been deposited in the ProteomeXchange (data set identifier PXD002179).
Assuntos
Genômica , Proteínas/metabolismo , Proteômica , Testículo/metabolismo , Cromatografia Líquida , Eletroforese em Gel de Poliacrilamida , Humanos , Masculino , Proteínas/isolamento & purificação , Análise de Sequência de RNA , Espectrometria de Massas em Tandem , TranscriptomaRESUMO
SUMMARY: With the advance of experimental technologies, different stable isotope labeling methods have been widely applied to quantitative proteomics. Here, we present an efficient tool named SILVER for processing the stable isotope labeling mass spectrometry data. SILVER implements novel methods for quality control of quantification at spectrum, peptide and protein levels, respectively. Several new quantification confidence filters and indices are used to improve the accuracy of quantification results. The performance of SILVER was verified and compared with MaxQuant and Proteome Discoverer using a large-scale dataset and two standard datasets. The results suggest that SILVER shows high accuracy and robustness while consuming much less processing time. Additionally, SILVER provides user-friendly interfaces for parameter setting, result visualization, manual validation and some useful statistics analyses. AVAILABILITY AND IMPLEMENTATION: SILVER and its source codes are freely available under the GNU General Public License v3.0 at http://bioinfo.hupo.org.cn/silver.
Assuntos
Cromatografia Líquida/métodos , Espectrometria de Massas/métodos , Peptídeos/análise , Proteoma/análise , Proteômica/métodos , Software , Marcação por Isótopo , Peptídeos/química , Controle de QualidadeRESUMO
To estimate the potential of the state-of-the-art proteomics technologies on full coverage of the encoding gene products, the Chinese Human Chromosome Proteome Consortium (CCPC) applied a multiomics strategy to systematically analyze the transciptome, translatome, and proteome of the same cultured hepatoma cells with varied metastatic potential qualitatively and quantitatively. The results provide a global view of gene expression profiles. The 9064 identified high confident proteins covered 50.2% of all gene products in the translatome. Those proteins with function of adhesion, development, reproduction, and so on are low abundant in transcriptome and translatome but absent in proteome. Taking the translatome as the background of protein expression, we found that the protein abundance plays a decisive role and hydrophobicity has a greater influence than molecular weight and isoelectric point on protein detectability. Thus, the enrichment strategy used for low-abundant transcription factors helped to identify missing proteins. In addition, those peptides with single amino acid polymorphisms played a significant role for the disease research, although they might negligibly contribute to new protein identification. The proteome raw and metadata of proteome were collected using the iProX submission system and submitted to ProteomeXchange (PXD000529, PXD000533, and PXD000535). All detailed information in this study can be accessed from the Chinese Chromosome-Centric Human Proteome Database.
Assuntos
Biossíntese de Proteínas , Proteoma , Transcriptoma , Linhagem Celular Tumoral , Perfilação da Expressão Gênica , Humanos , Espectrometria de MassasRESUMO
Our first proteomic exploration of human chromosome 1 began in 2012 (CCPD 1.0), and the genome-wide characterization of the human proteome through public resources revealed that 32-39% of proteins on chromosome 1 remain unidentified. To characterize all of the missing proteins, we applied an OMICS-integrated analysis of three human liver cell lines (Hep3B, MHCC97H, and HCCLM3) using mRNA and ribosome nascent-chain complex-bound mRNA deep sequencing and proteome profiling, contributing mass spectrometric evidence of 60 additional chromosome 1 gene products. Integration of the annotation information from public databases revealed that 84.6% of genes on chromosome 1 had high-confidence protein evidence. Hierarchical analysis demonstrated that the remaining 320 missing genes were either experimentally or biologically explainable; 128 genes were found to be tissue-specific or rarely expressed in some tissues, whereas 91 proteins were uncharacterized mainly due to database annotation diversity, 89 were genes with low mRNA abundance or unsuitable protein properties, and 12 genes were identifiable theoretically because of a high abundance of mRNAs/RNC-mRNAs and the existence of proteotypic peptides. The relatively large contribution made by the identification of enriched transcription factors suggested specific enrichment of low-abundance protein classes, and SRM/MRM could capture high-priority missing proteins. Detailed analyses of the differentially expressed genes indicated that several gene families located on chromosome 1 may play critical roles in mediating hepatocellular carcinoma invasion and metastasis. All mass spectrometry proteomics data corresponding to our study were deposited in the ProteomeXchange under the identifiers PXD000529, PXD000533, and PXD000535.
Assuntos
Cromossomos Humanos Par 1 , Proteínas/genética , Linhagem Celular Tumoral , Humanos , ProteômicaRESUMO
BioLadder (https://www.bioladder.cn/) is an online data analysis platform designed for proteomics research, which includes three classes of experimental data analysis modules and four classes of common data analysis modules. It allows for a variety of proteomics analyses to be conducted easily and efficiently. Additionally, most modules can also be utilized for the analysis of other omics data. To facilitate user experience, we have carefully designed four different kinds of functions for customers to quickly and accurately utilize the relevant analysis modules.
RESUMO
In this study, we examined the use of multiple proteases (trypsin, LysC, tandem LysC/trypsin) on both protein identification and quantification in the Lys-labeled SILAC mouse liver. Our results show that trypsin and tandem LysC/trypsin digestion are superior to LysC in peptides and protein identification while LysC shows advantages in quantification of Lys-labeled proteins. Combination of experimental results from different proteases (LysC and trypsin) enabled a significant increase in the number of identified protein and protein can be quantified. Thus, taking advantage of the complementation of different protease should be a good strategy to improve both qualitative and quantitative proteomics research.
Assuntos
Marcação por Isótopo/métodos , Fígado/química , Metaloendopeptidases/metabolismo , Fragmentos de Peptídeos/análise , Proteoma/análise , Tripsina/metabolismo , Animais , Fígado/metabolismo , Camundongos , Fragmentos de Peptídeos/química , Fragmentos de Peptídeos/metabolismo , Proteínas/análise , Proteínas/química , Proteínas/metabolismo , Proteoma/química , Proteoma/metabolismo , Proteômica/métodos , Espectrometria de Massas em TandemRESUMO
High-throughput mass spectrometry and antibody-based experiments have begun to produce a large amount of proteomic data sets. Chromosome-based visualization of these data sets and their annotations can help effectively integrate, organize, and analyze them. Therefore, we developed a web-based, user-friendly Chromosome-Assembled human Proteome browsER (CAPER). To display proteomic data sets and related annotations comprehensively, CAPER employs two distinct visualization strategies: track-view for the sequence/site information and the correspondence between proteome, transcriptome, genome, and chromosome and heatmap-view for the qualitative and quantitative functional annotations. CAPER supports data browsing at multiple scales through Google Map-like smooth navigation, zooming, and positioning with chromosomes as the reference coordinate. Both track-view and heatmap-view can mutually switch, providing a high-quality user interface. Taken together, CAPER will greatly facilitate the complete annotation and functional interpretation of the human genome by proteomic approaches, thereby making a significant contribution to the Chromosome-Centric Human Proteome Project and even the human physiology/pathology research. CAPER can be accessed at http://www.bprc.ac.cn/CAPE .
Assuntos
Bases de Dados de Proteínas , Internet , Proteoma , Anticorpos/genética , Anticorpos/metabolismo , Genoma Humano , Humanos , Armazenamento e Recuperação da Informação , Anotação de Sequência Molecular , Proteoma/genética , Proteoma/metabolismo , Software , Interface Usuário-ComputadorRESUMO
Chromosome 8, a medium-length euchromatic unit in humans that has an extraordinarily high mutation rate, can be detected not only in evolution but also in multiple mutant diseases, such as tumorigenesis, and further invasion/metastasis. The Chromosome-Centric Human Proteome Project of China systematically profiles the proteomes of three digestive organs (i.e., stomach, colon, and liver) and their corresponding carcinoma tissues/cell lines according to a chromosome organizational roadmap. By rigorous standards, we have identified 271 (38.7%), 330 (47.1%), and 325 (46.4%) of 701 chromosome 8-coded proteins from stomach, colon, and liver samples, respectively, in Swiss-Prot and observed a total coverage rate of up to 58.9% by 413 identified proteins. Using large-scale label-free proteome quantitation, we also found some 8p deficiencies, such as the presence of 8p21-p23 in tumorigenesis of the above-described digestive organs, which is in good agreement with previous reports. To our best knowledge, this is the first study to have verified these 8p deficiencies at the proteome level, complementing genome and transcriptome data.
Assuntos
Transformação Celular Neoplásica , Cromossomos Humanos Par 8 , Proteínas , Proteoma , Deleção Cromossômica , Cromossomos Humanos Par 8/genética , Cromossomos Humanos Par 8/metabolismo , Colo/metabolismo , Colo/patologia , Bases de Dados de Proteínas , Mucosa Gástrica/metabolismo , Genoma Humano , Projeto Genoma Humano , Humanos , Fígado/metabolismo , Fígado/patologia , Proteínas/classificação , Proteínas/genética , Proteínas/metabolismo , Estômago/patologiaRESUMO
The launch of the Chromosome-Centric Human Proteome Project provides an opportunity to gain insight into the human proteome. The Chinese Human Chromosome Proteome Consortium has initiated proteomic exploration of protein-encoding genes on human chromosomes 1, 8, and 20. Collaboration within the consortium has generated a comprehensive proteome data set using normal and carcinomatous tissues from human liver, stomach, and colon and 13 cell lines originating in these organs. We identified 12,101 proteins (59.8% coverage against Swiss-Prot human entries) with a protein false discovery rate of less than 1%. On chromosome 1, 1,252 proteins mapping to 1,227 genes, representing 60.9% of Swiss-Prot entries, were identified; however, 805 proteins remain unidentified, suggesting that analysis of more diverse samples using more advanced proteomic technologies is required. Genes encoding the unidentified proteins were concentrated in seven blocks, located at p36, q12-21, and q42-44, partly consistent with correlation of these blocks with cancers of the liver, stomach, and colon. Combined transcriptome, proteome, and cofunctionality analyses confirmed 23 coexpression clusters containing 165 genes. Biological information, including chromosome structure, GC content, and protein coexpression pattern was analyzed using multilayered, circular visualization and tabular visualization. Details of data analysis and updates are available in the Chinese Chromosome-Centric Human Proteome Database ( http://proteomeview.hupo.org.cn/chromosome/ ).
Assuntos
Cromossomos Humanos Par 1 , Proteínas , Proteoma , Cromossomos Humanos Par 1/genética , Cromossomos Humanos Par 1/metabolismo , Colo/metabolismo , Bases de Dados Factuais , Bases de Dados de Proteínas , Mucosa Gástrica/metabolismo , Expressão Gênica , Genoma Humano , Projeto Genoma Humano , Humanos , Fígado/metabolismo , Proteínas/classificação , Proteínas/genética , Proteínas/metabolismoRESUMO
BACKGROUND: A large amount of liver-related physiological and pathological data exist in publicly available biological and bibliographic databases, which are usually far from comprehensive or integrated. Data collection, integration and mining processes pose a great challenge to scientific researchers and clinicians interested in the liver. METHOD: To address these problems, we constructed LiverAtlas (http://liveratlas.hupo.org.cn), a comprehensive resource of biomedical knowledge related to the liver and various hepatic diseases by incorporating 53 databases. RESULTS: In the present version, LiverAtlas covers data on liver-related genomics, transcriptomics, proteomics, metabolomics and hepatic diseases. Additionally, LiverAtlas provides a wealth of manually curated information, relevant literature citations and cross-references to other databases. Importantly, an expert-confirmed Human Liver Disease Ontology, including relevant information for 227 types of hepatic disease, has been constructed and is used to annotate LiverAtlas data. Furthermore, we have demonstrated two examples of applying LiverAtlas data to identify candidate markers for hepatocellular carcinoma (HCC) at the systems level and to develop a systems biology-based classifier by combining the differential gene expression with topological features of human protein interaction networks to enhance the ability of HCC differential diagnosis. CONCLUSION: LiverAtlas is the most comprehensive liver and hepatic disease resource, which helps biologists and clinicians to analyse their data at the systems level and will contribute much to the biomarker discovery and diagnostic performance enhancement for liver diseases.
Assuntos
Bases de Dados Factuais , Bases de Conhecimento , Hepatopatias , Fígado , Biologia de Sistemas , Integração de Sistemas , Biomarcadores Tumorais/genética , Biomarcadores Tumorais/metabolismo , Carcinoma Hepatocelular/diagnóstico , Carcinoma Hepatocelular/genética , Carcinoma Hepatocelular/metabolismo , Mineração de Dados , Bases de Dados Genéticas , Diagnóstico Diferencial , Regulação Neoplásica da Expressão Gênica , Ontologia Genética , Redes Reguladoras de Genes , Testes Genéticos , Humanos , Fígado/metabolismo , Fígado/patologia , Fígado/fisiopatologia , Hepatopatias/diagnóstico , Hepatopatias/genética , Hepatopatias/metabolismo , Hepatopatias/fisiopatologia , Hepatopatias/terapia , Neoplasias Hepáticas/diagnóstico , Neoplasias Hepáticas/genética , Neoplasias Hepáticas/metabolismo , Valor Preditivo dos Testes , Prognóstico , Mapas de Interação de ProteínasRESUMO
Research on plasma proteomics has received extensive attention, because human plasma is an important sample for disease biomarker research due to its easy clinical accessibility and richness in biological information. Plasma samples contain a large number of leaked proteins from different tissues in the body, immune proteins and communication signal proteins. However, MS signal suppression from high-abundance proteins results in a large number of proteins that are present in low abundance in plasma not being detected by the LC-MS method. This situation makes it more difficult to study neurological diseases, where tissue sampling is difficult and body fluid samples such as plasma or cerebrospinal fluid are both affected by signal suppression. A large number of methods have been developed to deeply mine plasma proteomics information; however, their application limitations remain to some extent. Traditional immuno- or affinity-based depletion, fractionation and subproteome enrichment methods cannot meet the challenges of large clinical cohort applications due to limited time efficiency. In this study, a deep mining strategy of plasma proteomics was established by combing the protein corona formed by deep mining beads (DMB beads, hereafter referred to as magnetic covalent organic frameworks Fe3O4@TpPa-1), DIA-MS detection and the DIA-NN library searching method. By optimizing the enrichment step, mass spectrometry acquisition and data processing, the evaluation results of the deep mining strategy showed the following: depth, the strategy identified and quantified results of 2000+ proteins per plasma sample; stability, more than 87% of the enriched low-abundance proteins had CV < 20%; accuracy, good agreement between measured and theoretical values (1.81/2, 8.68/10, 38.36/50) for the gradient addition of E. coli proteins to a plasma sample; time efficiency, the processing time was reduced from >12h in the traditional method to <5h (incubation 30 min, washing 15 min, reductive/alkylation/digestion/desalting 4 h), and more importantly, 96 samples can be processed simultaneously in combination with the magnetic module of the automated device. The optimal strategy enables greater enrichment of neurological disease-related proteins, including SNCA and BDNF. Finally, the deep mining strategy was applied in a pilot study of multiple system atrophy (MSA) for biomarker discovery. The results showed that a total of 215 proteins were upregulated and 184 proteins were downregulated (p < 0.05) in the MSA group compared with the healthy control group. Eighteen proteins of these differentially expressed proteins were reported to be associated with neurological diseases or expressed specifically in brain tissue, 8 and 4 of which have reference concentrations of µg/L and ng/L, respectively. The alterations of ENPP2 and SLC2A1/Glut1 were reanalyzed by ELISA, further supporting the results of mass spectrometry. In conclusion, the results of the evaluation and application of the deep mining strategy showed promise for clinical research applications.
Assuntos
Nanoestruturas , Coroa de Proteína , Humanos , Proteômica/métodos , Escherichia coli , Projetos Piloto , Proteoma/análise , BiomarcadoresRESUMO
Lysine crotonylation (Kcr) is an evolutionarily conserved protein post-translational modifications, which plays an important role in cellular physiology and pathology, such as chromatin remodeling, gene transcription regulation, telomere maintenance, inflammation, and cancer. Tandem mass spectrometry (LC-MS/MS) has been used to identify the global Kcr profiling of human, at the same time, many computing methods have been developed to predict Kcr sites without high experiment cost. Deep learning network solves the problem of manual feature design and selection in traditional machine learning (NLP), especially the algorithms in natural language processing which treated peptides as sentences, thus can extract more in-depth information and obtain higher accuracy. In this work, we establish a Kcr prediction model named ATCLSTM-Kcr which use self-attention mechanism combined with NLP method to highlight the important features and further capture the internal correlation of the features, to realize the feature enhancement and noise reduction modules of the model. Independent tests have proved that ATCLSTM-Kcr has better accuracy and robustness than similar prediction tools. Then, we design pipeline to generate MS-based benchmark dataset to avoid the false negatives caused by MS-detectability and improve the sensitivity of Kcr prediction. Finally, we develop a Human Lysine Crotonylation Database (HLCD) which using ATCLSTM-Kcr and the two representative deep learning models to score all lysine sites of human proteome, and annotate all Kcr sites identified by MS of current published literatures. HLCD provides an integrated platform for human Kcr sites prediction and screening through multiple prediction scores and conditions, and can be accessed on the website:www.urimarker.com/HLCD/. SIGNIFICANCE: Lysine crotonylation (Kcr) plays an important role in cellular physiology and pathology, such as chromatin remodeling, gene transcription regulation and cancer. To better elucidate the molecular mechanisms of crotonylation and reduce the high experimental cost, we establish a deep learning Kcr prediction model and solve the problem of false negatives caused by the detectability of mass spectrometry (MS). Finally, we develop a Human Lysine Crotonylation Database to score all lysine sites of human proteome, and annotate all Kcr sites identified by MS of current published literatures. Our work provides a convenient platform for human Kcr sites prediction and screening through multiple prediction scores and conditions.