Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 22
Filtrar
1.
Sci Rep ; 14(1): 14387, 2024 06 22.
Artigo em Inglês | MEDLINE | ID: mdl-38909149

RESUMO

Angiogenesis is a key process for the proliferation and metastatic spread of cancer cells. Anti-angiogenic peptides (AAPs), with the capability of inhibiting angiogenesis, are promising candidates in cancer treatment. We propose AAPL, a sequence-based predictor to identify AAPs with machine learning models of improved prediction accuracy. Each peptide sequence was transformed to a vector of 4335 numeric values according to 58 different feature types, followed by a heuristic algorithm for feature selection. Next, the hyperparameters of six machine learning models were optimized with respect to the feature subset. We considered two datasets, one with entire peptide sequences and the other with 15 amino acids from peptide N-termini. AAPL achieved Matthew's correlation coefficients of 0.671 and 0.756 for independent tests based on the two datasets, respectively, outperforming existing predictors by a range of 5.3% to 24.6%. Further analyses show that AAPL yields higher prediction accuracy for peptides with more hydrophobic residues, and fewer hydrophilic and charged residues. The source code of AAPL is available at https://github.com/yunzheng2002/Anti-angiogenic .


Assuntos
Inibidores da Angiogênese , Aprendizado de Máquina , Peptídeos , Inibidores da Angiogênese/química , Inibidores da Angiogênese/farmacologia , Peptídeos/química , Peptídeos/farmacologia , Algoritmos , Sequência de Aminoácidos , Humanos
2.
J Med Internet Res ; 26: e48443, 2024 Jan 25.
Artigo em Inglês | MEDLINE | ID: mdl-38271060

RESUMO

BACKGROUND: The widespread use of electronic health records in the clinical and biomedical fields makes the removal of protected health information (PHI) essential to maintain privacy. However, a significant portion of information is recorded in unstructured textual forms, posing a challenge for deidentification. In multilingual countries, medical records could be written in a mixture of more than one language, referred to as code mixing. Most current clinical natural language processing techniques are designed for monolingual text, and there is a need to address the deidentification of code-mixed text. OBJECTIVE: The aim of this study was to investigate the effectiveness and underlying mechanism of fine-tuned pretrained language models (PLMs) in identifying PHI in the code-mixed context. Additionally, we aimed to evaluate the potential of prompting large language models (LLMs) for recognizing PHI in a zero-shot manner. METHODS: We compiled the first clinical code-mixed deidentification data set consisting of text written in Chinese and English. We explored the effectiveness of fine-tuned PLMs for recognizing PHI in code-mixed content, with a focus on whether PLMs exploit naming regularity and mention coverage to achieve superior performance, by probing the developed models' outputs to examine their decision-making process. Furthermore, we investigated the potential of prompt-based in-context learning of LLMs for recognizing PHI in code-mixed text. RESULTS: The developed methods were evaluated on a code-mixed deidentification corpus of 1700 discharge summaries. We observed that different PHI types had preferences in their occurrences within the different types of language-mixed sentences, and PLMs could effectively recognize PHI by exploiting the learned name regularity. However, the models may exhibit suboptimal results when regularity is weak or mentions contain unknown words that the representations cannot generate well. We also found that the availability of code-mixed training instances is essential for the model's performance. Furthermore, the LLM-based deidentification method was a feasible and appealing approach that can be controlled and enhanced through natural language prompts. CONCLUSIONS: The study contributes to understanding the underlying mechanism of PLMs in addressing the deidentification process in the code-mixed context and highlights the significance of incorporating code-mixed training instances into the model training phase. To support the advancement of research, we created a manipulated subset of the resynthesized data set available for research purposes. Based on the compiled data set, we found that the LLM-based deidentification method is a feasible approach, but carefully crafted prompts are essential to avoid unwanted output. However, the use of such methods in the hospital setting requires careful consideration of data security and privacy concerns. Further research could explore the augmentation of PLMs and LLMs with external knowledge to improve their strength in recognizing rare PHI.


Assuntos
Inteligência Artificial , Registros Eletrônicos de Saúde , Humanos , Processamento de Linguagem Natural , Privacidade , China
3.
Sci Rep ; 13(1): 14119, 2023 08 29.
Artigo em Inglês | MEDLINE | ID: mdl-37644119

RESUMO

Isobaric labeling relative quantitation is one of the dominating proteomic quantitation technologies. Traditional quantitation pipelines for isobaric-labeled mass spectrometry data are based on sequence database searching. In this study, we present a novel quantitation pipeline that integrates sequence database searching, spectral library searching, and a feature-based peptide-spectrum-match (PSM) filter using various spectral features for filtering. The combined database and spectral library searching results in larger quantitation coverage, and the filter removes PSMs with larger quantitation errors, retaining those with higher quantitation accuracy. Quantitation results show that the proposed pipeline can improve the overall quantitation accuracy at the PSM and protein levels. To our knowledge, this is the first study that utilizes spectral library searching to improve isobaric labeling-based quantitation. For users to conveniently perform the proposed pipeline, we have implemented the feature-based filter being executable on both Windows and Linux platforms; its executable files, user manual, and sample data sets are freely available at https://ms.iis.sinica.edu.tw/comics/Software_FPF.html . Furthermore, with the developed filter, the proposed pipeline is fully compatible with the Trans-Proteomic Pipeline.


Assuntos
Bases de Dados de Ácidos Nucleicos , Proteômica , Biblioteca Gênica , Espectrometria de Massas , Peptídeos
4.
Sci Rep ; 12(1): 2045, 2022 02 07.
Artigo em Inglês | MEDLINE | ID: mdl-35132134

RESUMO

Identifying peptides and proteins from mass spectrometry (MS) data, spectral library searching has emerged as a complementary approach to the conventional database searching. However, for the spectrum-centric analysis of data-independent acquisition (DIA) data, spectral library searching has not been widely exploited because existing spectral library search tools are mainly designed and optimized for the analysis of data-dependent acquisition (DDA) data. We present Calibr, a spectral library search tool for spectrum-centric DIA data analysis. Calibr optimizes spectrum preprocessing for pseudo MS2 spectra, generating an 8.11% increase in spectrum-spectrum match (SSM) number and a 7.49% increase in peptide number over the traditional preprocessing approach. When searching against the DDA-based spectral library, Calibr improves SSM number by 17.6-26.65% and peptide number by 18.45-37.31% over two state-of-the-art tools on three different data sets. Searching against the public spectral library from MassIVE, Calibr improves state-of-the-art tools in SSM and peptide numbers by more than 31.49% and 25.24%, respectively, for two data sets. Our analyses indicate higher sensitivity of Calibr results from the use of various spectral similarity measures and statistical scores, coupled with machine learning-based statistical validation for FDR control. Calibr executable files including a graphical user-interface application are available at https://ms.iis.sinica.edu.tw/COmics/Software_CalibrWizard.html and https://sourceforge.net/projects/comics-calibr .


Assuntos
Espectrometria de Massas/métodos , Biblioteca de Peptídeos , Peptídeos/química , Peptídeos/genética , Proteínas/química , Proteínas/genética , Proteômica/métodos , Bases de Dados como Assunto , Conjuntos de Dados como Assunto
5.
Artigo em Inglês | MEDLINE | ID: mdl-36612710

RESUMO

The aim of the current study was to investigate the relationship between nasopharyngeal carcinoma (NPC) and dry eye disease (DED) using the National Health Insurance Research Database (NHIRD) of Taiwan. A retrospective cohort study was conducted, and patients with an NPC diagnosis were included. Next, one NPC patient was matched to four non-NPC participants via demographic data and systemic comorbidities. In total, 4184 and 16,736 participants were enrolled in the NPC and non-NPC groups, respectively. The primary outcome was the development of DED one year after the diagnosis of NPC. Cox proportional hazard regression was applied to estimate the adjusted hazard ratios (aHRs) with 95% confidence intervals (CIs) of DED. In this study, 717 and 2225 DED cases were found in the NPC and non-NPC groups, respectively, and the NPC group showed a significantly higher incidence of DED development compared to the non-NPC group (aHR: 1.45, 95% CI: 1.33−1.58, p < 0.0001) in the multivariable analysis. The other covariates that were positively correlated with DED development included age over 40 years, an education level higher than senior high school, hypertension, DM, allergic pulmonary diseases, allergic otolaryngologic diseases, and allergic dermatological diseases (all p < 0.05). In conclusion, the presence of NPC is an independent risk factor for subsequent DED.


Assuntos
Síndromes do Olho Seco , Hipersensibilidade , Neoplasias Nasofaríngeas , Humanos , Adulto , Carcinoma Nasofaríngeo/epidemiologia , Carcinoma Nasofaríngeo/complicações , Estudos de Coortes , Estudos Retrospectivos , Fatores de Risco , Hipersensibilidade/complicações , Síndromes do Olho Seco/epidemiologia , Síndromes do Olho Seco/etiologia , Taiwan/epidemiologia , Neoplasias Nasofaríngeas/epidemiologia , Neoplasias Nasofaríngeas/complicações
6.
Sci Rep ; 11(1): 2233, 2021 01 26.
Artigo em Inglês | MEDLINE | ID: mdl-33500498

RESUMO

Mass spectrometry-based proteomics using isobaric labeling for multiplex quantitation has become a popular approach for proteomic studies. We present Multi-Q 2, an isobaric-labeling quantitation tool which can yield the largest quantitation coverage and improved quantitation accuracy compared to three state-of-the-art methods. Multi-Q 2 supports identification results from several popular proteomic data analysis platforms for quantitation, offering up to 12% improvement in quantitation coverage for accepting identification results from multiple search engines when compared with MaxQuant and PatternLab. It is equipped with various quantitation algorithms, including a ratio compression correction algorithm, and results in up to 336 algorithmic combinations. Systematic evaluation shows different algorithmic combinations have different strengths and are suitable for different situations. We also demonstrate that the flexibility of Multi-Q 2 in customizing algorithmic combination can lead to improved quantitation accuracy over existing tools. Moreover, the use of complementary algorithmic combinations can be an effective strategy to enhance sensitivity when searching for biomarkers from differentially expressed proteins in proteomic experiments. Multi-Q 2 provides interactive graphical interfaces to process quantitation and to display ratios at protein, peptide, and spectrum levels. It also supports a heatmap module, enabling users to cluster proteins based on their abundance ratios and to visualize the clustering results. Multi-Q 2 executable files, sample data sets, and user manual are freely available at http://ms.iis.sinica.edu.tw/COmics/Software_Multi-Q2.html .

7.
Cell ; 182(1): 226-244.e17, 2020 07 09.
Artigo em Inglês | MEDLINE | ID: mdl-32649875

RESUMO

Lung cancer in East Asia is characterized by a high percentage of never-smokers, early onset and predominant EGFR mutations. To illuminate the molecular phenotype of this demographically distinct disease, we performed a deep comprehensive proteogenomic study on a prospectively collected cohort in Taiwan, representing early stage, predominantly female, non-smoking lung adenocarcinoma. Integrated genomic, proteomic, and phosphoproteomic analysis delineated the demographically distinct molecular attributes and hallmarks of tumor progression. Mutational signature analysis revealed age- and gender-related mutagenesis mechanisms, characterized by high prevalence of APOBEC mutational signature in younger females and over-representation of environmental carcinogen-like mutational signatures in older females. A proteomics-informed classification distinguished the clinical characteristics of early stage patients with EGFR mutations. Furthermore, integrated protein network analysis revealed the cellular remodeling underpinning clinical trajectories and nominated candidate biomarkers for patient stratification and therapeutic intervention. This multi-omic molecular architecture may help develop strategies for management of early stage never-smoker lung adenocarcinoma.


Assuntos
Progressão da Doença , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/patologia , Proteogenômica , Fumar/genética , Adenocarcinoma de Pulmão/genética , Adenocarcinoma de Pulmão/patologia , Biomarcadores Tumorais/genética , Biomarcadores Tumorais/metabolismo , Carcinógenos/toxicidade , Estudos de Coortes , Citosina Desaminase/metabolismo , Ásia Oriental , Regulação Neoplásica da Expressão Gênica , Redes Reguladoras de Genes , Genoma Humano , Humanos , Metaloproteinases da Matriz/metabolismo , Mutação/genética , Análise de Componente Principal
8.
BMC Genomics ; 20(Suppl 9): 906, 2019 Dec 24.
Artigo em Inglês | MEDLINE | ID: mdl-31874640

RESUMO

BACKGROUND: Tandem mass spectrometry allows biologists to identify and quantify protein samples in the form of digested peptide sequences. When performing peptide identification, spectral library search is more sensitive than traditional database search but is limited to peptides that have been previously identified. An accurate tandem mass spectrum prediction tool is thus crucial in expanding the peptide space and increasing the coverage of spectral library search. RESULTS: We propose MS2CNN, a non-linear regression model based on deep convolutional neural networks, a deep learning algorithm. The features for our model are amino acid composition, predicted secondary structure, and physical-chemical features such as isoelectric point, aromaticity, helicity, hydrophobicity, and basicity. MS2CNN was trained with five-fold cross validation on a three-way data split on the large-scale human HCD MS2 dataset of Orbitrap LC-MS/MS downloaded from the National Institute of Standards and Technology. It was then evaluated on a publicly available independent test dataset of human HeLa cell lysate from LC-MS experiments. On average, our model shows better cosine similarity and Pearson correlation coefficient (0.690 and 0.632) than MS2PIP (0.647 and 0.601) and is comparable with pDeep (0.692 and 0.642). Notably, for the more complex MS2 spectra of 3+ peptides, MS2PIP is significantly better than both MS2PIP and pDeep. CONCLUSIONS: We showed that MS2CNN outperforms MS2PIP for 2+ and 3+ peptides and pDeep for 3+ peptides. This implies that MS2CNN, the proposed convolutional neural network model, generates highly accurate MS2 spectra for LC-MS/MS experiments using Orbitrap machines, which can be of great help in protein and peptide identifications. The results suggest that incorporating more data for deep learning model may improve performance.


Assuntos
Redes Neurais de Computação , Espectrometria de Massas em Tandem/métodos , Aprendizado Profundo , Células HeLa , Humanos , Peptídeos/química , Proteínas/química , Análise de Sequência de Proteína
9.
Sci Rep ; 9(1): 15975, 2019 11 04.
Artigo em Inglês | MEDLINE | ID: mdl-31685900

RESUMO

N-linked glycosylation is one of the predominant post-translational modifications involved in a number of biological functions. Since experimental characterization of glycosites is challenging, glycosite prediction is crucial. Several predictors have been made available and report high performance. Most of them evaluate their performance at every asparagine in protein sequences, not confined to asparagine in the N-X-S/T sequon. In this paper, we present N-GlyDE, a two-stage prediction tool trained on rigorously-constructed non-redundant datasets to predict N-linked glycosites in the human proteome. The first stage uses a protein similarity voting algorithm trained  on both glycoproteins and non-glycoproteins to predict a score for a protein to improve glycosite prediction. The second stage uses a support vector machine to predict N-linked glycosites by utilizing features of gapped dipeptides, pattern-based predicted surface accessibility, and predicted secondary structure. N-GlyDE's final predictions are derived from a weight adjustment of the second-stage prediction results based on the first-stage prediction score. Evaluated on N-X-S/T sequons of an independent dataset comprised of 53 glycoproteins and 33 non-glycoproteins, N-GlyDE achieves an accuracy and MCC of 0.740 and 0.499, respectively, outperforming the compared tools. The N-GlyDE web server is available at http://bioapp.iis.sinica.edu.tw/N-GlyDE/ .

10.
J Proteome Res ; 18(12): 4124-4132, 2019 12 06.
Artigo em Inglês | MEDLINE | ID: mdl-31429573

RESUMO

When conducting proteomics experiments to detect missing proteins and protein isoforms in the human proteome, it is desirable to use a protease that can yield more unique peptides with properties amenable for mass spectrometry analysis. Though trypsin is currently the most widely used protease, some proteins can yield only a limited number of unique peptides by trypsin digestion. Other proteases and multiple proteases have been applied in reported studies to increase the number of identified proteins and protein sequence coverage. To facilitate the selection of proteases, we developed a web-based resource, called in silico Human Proteome Digestion Map (iHPDM), which contains a comprehensive proteolytic peptide database constructed from human proteins, including isoforms, in neXtProt digested by 15 protease combinations of one or two proteases. iHPDM provides convenient functions and graphical visualizations for users to examine and compare the digestion results of different proteases. Notably, it also supports users to input filtering criteria on digested peptides, e.g., peptide length and uniqueness, to select suitable proteases. iHPDM can facilitate protease selection for shotgun proteomics experiments to identify missing proteins, protein isoforms, and single amino acid variant peptides.


Assuntos
Peptídeo Hidrolases/metabolismo , Mapeamento de Peptídeos/métodos , Proteoma/metabolismo , Gráficos por Computador , Simulação por Computador , Visualização de Dados , Bases de Dados Factuais , Receptores ErbB/metabolismo , Humanos , Internet , MAP Quinase Quinase 1/metabolismo , N-Acetilexosaminiltransferases/metabolismo , Isoformas de Proteínas/metabolismo , Proteômica/métodos , Receptores Odorantes/metabolismo , Interface Usuário-Computador , gama-Glutamiltransferase/metabolismo
11.
Anal Chem ; 91(15): 9403-9406, 2019 08 06.
Artigo em Inglês | MEDLINE | ID: mdl-31305071

RESUMO

Protein and peptide identification and quantitation are essential tasks in proteomics research and involve a series of steps in analyzing mass spectrometry data. Trans-Proteomic Pipeline (TPP) provides a wide range of useful tools through its web interfaces for analyses such as sequence database search, statistical validation, and quantitation. To utilize the powerful functionality of TPP without the need for manual intervention to launch each step, we developed a software tool, called WinProphet, to create and automatically execute a pipeline for proteomic analyses. It seamlessly integrates with TPP and other external command-line programs, supporting various functionalities, including database search for protein and peptide identification, spectral library construction and search, data-independent acquisition (DIA) data analysis, and isobaric labeling and label-free quantitation. WinProphet is a standalone, installation-free tool with graphical interfaces for users to configure, manage, and automatically execute pipelines. The constructed pipelines can be exported as XML files with all of the parameter settings for reusability and portability. The executable files, user manual, and sample data sets of WinProphet are freely available at  http://ms.iis.sinica.edu.tw/COmics/Software_WinProphet.html .


Assuntos
Análise de Dados , Proteômica/métodos , Software , Interface Usuário-Computador , Fluxo de Trabalho
12.
Anal Chem ; 89(24): 13128-13136, 2017 12 19.
Artigo em Inglês | MEDLINE | ID: mdl-29165996

RESUMO

Top-down proteomics using liquid chromatogram coupled with mass spectrometry has been increasingly applied for analyzing intact proteins to study genetic variation, alternative splicing, and post-translational modifications (PTMs) of the proteins (proteoforms). However, only a few tools have been developed for charge state deconvolution, monoisotopic/average molecular weight determination and quantitation of proteoforms from LC-MS1 spectra. Though Decon2LS and MASH Suite Pro have been available to provide intraspectrum charge state deconvolution and quantitation, manual processing is still required to quantify proteoforms across multiple MS1 spectra. An automated tool for interspectrum quantitation is a pressing need. Thus, in this paper, we present a user-friendly tool, called iTop-Q (intelligent Top-down Proteomics Quantitation), that automatically performs large-scale proteoform quantitation based on interspectrum abundance in top-down proteomics. Instead of utilizing single spectrum for proteoform quantitation, iTop-Q constructs extracted ion chromatograms (XICs) of possible proteoform peaks across adjacent MS1 spectra to calculate abundances for accurate quantitation. Notably, iTop-Q is implemented with a newly proposed algorithm, called DYAMOND, using dynamic programming for charge state deconvolution. In addition, iTop-Q performs proteoform alignment to support quantitation analysis across replicates/samples. The performance evaluations on an in-house standard data set and a public large-scale yeast lysate data set show that iTop-Q achieves highly accurate quantitation, more consistent quantitation than using intraspectrum quantitation. Furthermore, the DYAMOND algorithm is suitable for high charge state deconvolution and can distinguish shared peaks in coeluting proteoforms. iTop-Q is publicly available for download at http://ms.iis.sinica.edu.tw/COmics/Software_iTop-Q .


Assuntos
Algoritmos , Proteínas/análise , Proteômica , Cromatografia Líquida , Espectrometria de Massas
13.
Genome Biol ; 17(1): 184, 2016 09 07.
Artigo em Inglês | MEDLINE | ID: mdl-27604469

RESUMO

BACKGROUND: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. RESULTS: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. CONCLUSIONS: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent.


Assuntos
Biologia Computacional , Proteínas/química , Software , Relação Estrutura-Atividade , Algoritmos , Bases de Dados de Proteínas , Ontologia Genética , Humanos , Anotação de Sequência Molecular , Proteínas/genética
14.
Nucleic Acids Res ; 44(W1): W575-80, 2016 Jul 08.
Artigo em Inglês | MEDLINE | ID: mdl-27084943

RESUMO

MAGIC-web is the first web server, to the best of our knowledge, that performs both untargeted and targeted analyses of mass spectrometry-based glycoproteomics data for site-specific N-linked glycoprotein identification. The first two modules, MAGIC and MAGIC+, are designed for untargeted and targeted analysis, respectively. MAGIC is implemented with our previously proposed novel Y1-ion pattern matching method, which adequately detects Y1- and Y0-ion without prior information of proteins and glycans, and then generates in silico MS(2) spectra that serve as input to a database search engine (e.g. Mascot) to search against a large-scale protein sequence database. On top of that, the newly implemented MAGIC+ allows users to determine glycopeptide sequences using their own protein sequence file. The third module, Reports Integrator, provides the service of combining protein identification results from Mascot and glycan-related information from MAGIC-web to generate a complete site-specific protein-glycan summary report. The last module, Glycan Search, is designed for the users who are interested in finding possible glycan structures with specific numbers and types of monosaccharides. The results from MAGIC, MAGIC+ and Reports Integrator can be downloaded via provided links whereas the annotated spectra and glycan structures can be visualized in the browser. MAGIC-web is accessible from http://ms.iis.sinica.edu.tw/MAGIC-web/index.html.


Assuntos
Glicoproteínas/análise , Glicoproteínas/química , Internet , Polissacarídeos/análise , Polissacarídeos/química , Software , Simulação por Computador , Bases de Dados de Proteínas , Glicopeptídeos/análise , Glicopeptídeos/química , Humanos , Espectrometria de Massas , Proteômica , Ferramenta de Busca , Interface Usuário-Computador , Navegador
15.
PLoS One ; 11(1): e0146112, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-26784691

RESUMO

Efficient and accurate quantitation of metabolites from LC-MS data has become an important topic. Here we present an automated tool, called iMet-Q (intelligent Metabolomic Quantitation), for label-free metabolomics quantitation from high-throughput MS1 data. By performing peak detection and peak alignment, iMet-Q provides a summary of quantitation results and reports ion abundance at both replicate level and sample level. Furthermore, it gives the charge states and isotope ratios of detected metabolite peaks to facilitate metabolite identification. An in-house standard mixture and a public Arabidopsis metabolome data set were analyzed by iMet-Q. Three public quantitation tools, including XCMS, MetAlign, and MZmine 2, were used for performance comparison. From the mixture data set, seven standard metabolites were detected by the four quantitation tools, for which iMet-Q had a smaller quantitation error of 12% in both profile and centroid data sets. Our tool also correctly determined the charge states of seven standard metabolites. By searching the mass values for those standard metabolites against Human Metabolome Database, we obtained a total of 183 metabolite candidates. With the isotope ratios calculated by iMet-Q, 49% (89 out of 183) metabolite candidates were filtered out. From the public Arabidopsis data set reported with two internal standards and 167 elucidated metabolites, iMet-Q detected all of the peaks corresponding to the internal standards and 167 metabolites. Meanwhile, our tool had small abundance variation (≤ 0.19) when quantifying the two internal standards and had higher abundance correlation (≥ 0.92) when quantifying the 167 metabolites. iMet-Q provides user-friendly interfaces and is publicly available for download at http://ms.iis.sinica.edu.tw/comics/Software_iMet-Q.html.


Assuntos
Metaboloma , Metabolômica/métodos , Software , Arabidopsis/metabolismo , Humanos
16.
J Proteome Res ; 14(12): 5396-407, 2015 Dec 04.
Artigo em Inglês | MEDLINE | ID: mdl-26549055

RESUMO

Protein experiment evidence at protein level from mass spectrometry and antibody experiments are essential to characterize the human proteome. neXtProt (2014-09 release) reported 20 055 human proteins, including 16 491 proteins identified at protein level and 3564 proteins unidentified. Excluding 616 proteins at uncertain level, 2948 proteins were regarded as missing proteins. Missing proteins were unidentified partially due to MS limitations and intrinsic properties of proteins, for example, only appearing in specific diseases or tissues. Despite such reasons, it is desirable to explore issues affecting validation of missing proteins from an "ideal" shotgun analysis of human proteome. We thus performed in silico digestions on the human proteins to generate all in silico fully digested peptides. With these presumed peptides, we investigated the identification of proteins without any unique peptide, the effect of sequence variants on protein identification, difficulties in identifying olfactory receptors, and highly similar proteins. Among all proteins with evidence at transcript level, G protein-coupled receptors and olfactory receptors, based on InterPro classification, were the largest families of proteins and exhibited more frequent variants. To identify missing proteins, the above analyses suggested including sequence variants in protein FASTA for database searching. Furthermore, evidence of unique peptides identified from MS experiments would be crucial for experimentally validating missing proteins.


Assuntos
Proteômica/métodos , Sequência de Aminoácidos , Anexinas/química , Anexinas/genética , Biologia Computacional/métodos , Simulação por Computador , Bases de Dados de Proteínas , Variação Genética , Humanos , Interações Hidrofóbicas e Hidrofílicas , Espectrometria de Massas , Anotação de Sequência Molecular , Dados de Sequência Molecular , Fragmentos de Peptídeos/química , Fragmentos de Peptídeos/genética , Fragmentos de Peptídeos/isolamento & purificação , Proteólise , Proteoma/química , Proteoma/genética , Proteoma/isolamento & purificação , Proteômica/estatística & dados numéricos , Receptores Odorantes/química , Receptores Odorantes/genética , Receptores Odorantes/isolamento & purificação
17.
PLoS One ; 7(7): e40846, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22848404

RESUMO

Non-covalent protein-carbohydrate interactions mediate molecular targeting in many biological processes. Prediction of non-covalent carbohydrate binding sites on protein surfaces not only provides insights into the functions of the query proteins; information on key carbohydrate-binding residues could suggest site-directed mutagenesis experiments, design therapeutics targeting carbohydrate-binding proteins, and provide guidance in engineering protein-carbohydrate interactions. In this work, we show that non-covalent carbohydrate binding sites on protein surfaces can be predicted with relatively high accuracy when the query protein structures are known. The prediction capabilities were based on a novel encoding scheme of the three-dimensional probability density maps describing the distributions of 36 non-covalent interacting atom types around protein surfaces. One machine learning model was trained for each of the 30 protein atom types. The machine learning algorithms predicted tentative carbohydrate binding sites on query proteins by recognizing the characteristic interacting atom distribution patterns specific for carbohydrate binding sites from known protein structures. The prediction results for all protein atom types were integrated into surface patches as tentative carbohydrate binding sites based on normalized prediction confidence level. The prediction capabilities of the predictors were benchmarked by a 10-fold cross validation on 497 non-redundant proteins with known carbohydrate binding sites. The predictors were further tested on an independent test set with 108 proteins. The residue-based Matthews correlation coefficient (MCC) for the independent test was 0.45, with prediction precision and sensitivity (or recall) of 0.45 and 0.49 respectively. In addition, 111 unbound carbohydrate-binding protein structures for which the structures were determined in the absence of the carbohydrate ligands were predicted with the trained predictors. The overall prediction MCC was 0.49. Independent tests on anti-carbohydrate antibodies showed that the carbohydrate antigen binding sites were predicted with comparable accuracy. These results demonstrate that the predictors are among the best in carbohydrate binding site predictions to date.


Assuntos
Inteligência Artificial , Carboidratos/química , Bases de Dados de Proteínas , Modelos Moleculares , Proteínas/química , Análise de Sequência de Proteína , Sítios de Ligação , Proteínas/genética
18.
PLoS One ; 7(6): e37706, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22701576

RESUMO

Protein-protein interactions are key to many biological processes. Computational methodologies devised to predict protein-protein interaction (PPI) sites on protein surfaces are important tools in providing insights into the biological functions of proteins and in developing therapeutics targeting the protein-protein interaction sites. One of the general features of PPI sites is that the core regions from the two interacting protein surfaces are complementary to each other, similar to the interior of proteins in packing density and in the physicochemical nature of the amino acid composition. In this work, we simulated the physicochemical complementarities by constructing three-dimensional probability density maps of non-covalent interacting atoms on the protein surfaces. The interacting probabilities were derived from the interior of known structures. Machine learning algorithms were applied to learn the characteristic patterns of the probability density maps specific to the PPI sites. The trained predictors for PPI sites were cross-validated with the training cases (consisting of 432 proteins) and were tested on an independent dataset (consisting of 142 proteins). The residue-based Matthews correlation coefficient for the independent test set was 0.423; the accuracy, precision, sensitivity, specificity were 0.753, 0.519, 0.677, and 0.779 respectively. The benchmark results indicate that the optimized machine learning models are among the best predictors in identifying PPI sites on protein surfaces. In particular, the PPI site prediction accuracy increases with increasing size of the PPI site and with increasing hydrophobicity in amino acid composition of the PPI interface; the core interface regions are more likely to be recognized with high prediction confidence. The results indicate that the physicochemical complementarity patterns on protein surfaces are important determinants in PPIs, and a substantial portion of the PPI sites can be predicted correctly with the physicochemical complementarity features based on the non-covalent interaction data derived from protein interiors.


Assuntos
Aminoácidos/química , Biologia Computacional/métodos , Modelos Químicos , Modelos Moleculares , Mapeamento de Interação de Proteínas/métodos , Proteínas/química , Algoritmos , Inteligência Artificial , Simulação por Computador , Redes Neurais de Computação , Probabilidade , Distribuições Estatísticas , Estatísticas não Paramétricas
19.
PLoS One ; 7(3): e33340, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22457753

RESUMO

Protein-protein interactions are critical determinants in biological systems. Engineered proteins binding to specific areas on protein surfaces could lead to therapeutics or diagnostics for treating diseases in humans. But designing epitope-specific protein-protein interactions with computational atomistic interaction free energy remains a difficult challenge. Here we show that, with the antibody-VEGF (vascular endothelial growth factor) interaction as a model system, the experimentally observed amino acid preferences in the antibody-antigen interface can be rationalized with 3-dimensional distributions of interacting atoms derived from the database of protein structures. Machine learning models established on the rationalization can be generalized to design amino acid preferences in antibody-antigen interfaces, for which the experimental validations are tractable with current high throughput synthetic antibody display technologies. Leave-one-out cross validation on the benchmark system yielded the accuracy, precision, recall (sensitivity) and specificity of the overall binary predictions to be 0.69, 0.45, 0.63, and 0.71 respectively, and the overall Matthews correlation coefficient of the 20 amino acid types in the 24 interface CDR positions was 0.312. The structure-based computational antibody design methodology was further tested with other antibodies binding to VEGF. The results indicate that the methodology could provide alternatives to the current antibody technologies based on animal immune systems in engineering therapeutic and diagnostic antibodies against predetermined antigen epitopes.


Assuntos
Reações Antígeno-Anticorpo , Regiões Determinantes de Complementaridade , Inteligência Artificial , Sítios de Ligação de Anticorpos , Cristalografia por Raios X , Humanos , Modelos Moleculares , Reprodutibilidade dos Testes , Anticorpos de Cadeia Única/química , Anticorpos de Cadeia Única/imunologia , Fator A de Crescimento do Endotélio Vascular/imunologia
20.
BMC Bioinformatics ; 10 Suppl 15: S8, 2009 Dec 03.
Artigo em Inglês | MEDLINE | ID: mdl-19958518

RESUMO

BACKGROUND: The study of protein subcellular localization (PSL) is important for elucidating protein functions involved in various cellular processes. However, determining the localization sites of a protein through wet-lab experiments can be time-consuming and labor-intensive. Thus, computational approaches become highly desirable. Most of the PSL prediction systems are established for single-localized proteins. However, a significant number of eukaryotic proteins are known to be localized into multiple subcellular organelles. Many studies have shown that proteins may simultaneously locate or move between different cellular compartments and be involved in different biological processes with different roles. RESULTS: In this study, we propose a knowledge based method, called KnowPredsite, to predict the localization site(s) of both single-localized and multi-localized proteins. Based on the local similarity, we can identify the "related sequences" for prediction. We construct a knowledge base to record the possible sequence variations for protein sequences. When predicting the localization annotation of a query protein, we search against the knowledge base and used a scoring mechanism to determine the predicted sites. We downloaded the dataset from ngLOC, which consisted of ten distinct subcellular organelles from 1923 species, and performed ten-fold cross validation experiments to evaluate KnowPred site's performance. The experiment results show that KnowPred site achieves higher prediction accuracy than ngLOC and Blast-hit method. For single-localized proteins, the overall accuracy of KnowPred site is 91.7%. For multi-localized proteins, the overall accuracy of KnowPred site is 72.1%, which is significantly higher than that of ngLOC by 12.4%. Notably, half of the proteins in the dataset that cannot find any Blast hit sequence above a specified threshold can still be correctly predicted by KnowPred site. CONCLUSION: KnowPred site demonstrates the power of identifying related sequences in the knowledge base. The experiment results show that even though the sequence similarity is low, the local similarity is effective for prediction. Experiment results show that KnowPred site is a highly accurate prediction method for both single- and multi-localized proteins. It is worth-mentioning the prediction process of KnowPred site is transparent and biologically interpretable and it shows a set of template sequences to generate the prediction result. The KnowPred site prediction server is available at http://bio-cluster.iis.sinica.edu.tw/kbloc/.


Assuntos
Biologia Computacional/métodos , Proteínas/análise , Proteínas/química , Bases de Dados de Proteínas , Eucariotos , Bases de Conhecimento , Análise de Sequência de Proteína/métodos , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA