Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 23
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Cell ; 182(1): 226-244.e17, 2020 07 09.
Artículo en Inglés | MEDLINE | ID: mdl-32649875

RESUMEN

Lung cancer in East Asia is characterized by a high percentage of never-smokers, early onset and predominant EGFR mutations. To illuminate the molecular phenotype of this demographically distinct disease, we performed a deep comprehensive proteogenomic study on a prospectively collected cohort in Taiwan, representing early stage, predominantly female, non-smoking lung adenocarcinoma. Integrated genomic, proteomic, and phosphoproteomic analysis delineated the demographically distinct molecular attributes and hallmarks of tumor progression. Mutational signature analysis revealed age- and gender-related mutagenesis mechanisms, characterized by high prevalence of APOBEC mutational signature in younger females and over-representation of environmental carcinogen-like mutational signatures in older females. A proteomics-informed classification distinguished the clinical characteristics of early stage patients with EGFR mutations. Furthermore, integrated protein network analysis revealed the cellular remodeling underpinning clinical trajectories and nominated candidate biomarkers for patient stratification and therapeutic intervention. This multi-omic molecular architecture may help develop strategies for management of early stage never-smoker lung adenocarcinoma.


Asunto(s)
Progresión de la Enfermedad , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/patología , Proteogenómica , Fumar/genética , Adenocarcinoma del Pulmón/genética , Adenocarcinoma del Pulmón/patología , Biomarcadores de Tumor/genética , Biomarcadores de Tumor/metabolismo , Carcinógenos/toxicidad , Estudios de Cohortes , Citosina Desaminasa/metabolismo , Asia Oriental , Regulación Neoplásica de la Expresión Génica , Redes Reguladoras de Genes , Genoma Humano , Humanos , Metaloproteinasas de la Matriz/metabolismo , Mutación/genética , Análisis de Componente Principal
2.
J Med Internet Res ; 26: e48443, 2024 Jan 25.
Artículo en Inglés | MEDLINE | ID: mdl-38271060

RESUMEN

BACKGROUND: The widespread use of electronic health records in the clinical and biomedical fields makes the removal of protected health information (PHI) essential to maintain privacy. However, a significant portion of information is recorded in unstructured textual forms, posing a challenge for deidentification. In multilingual countries, medical records could be written in a mixture of more than one language, referred to as code mixing. Most current clinical natural language processing techniques are designed for monolingual text, and there is a need to address the deidentification of code-mixed text. OBJECTIVE: The aim of this study was to investigate the effectiveness and underlying mechanism of fine-tuned pretrained language models (PLMs) in identifying PHI in the code-mixed context. Additionally, we aimed to evaluate the potential of prompting large language models (LLMs) for recognizing PHI in a zero-shot manner. METHODS: We compiled the first clinical code-mixed deidentification data set consisting of text written in Chinese and English. We explored the effectiveness of fine-tuned PLMs for recognizing PHI in code-mixed content, with a focus on whether PLMs exploit naming regularity and mention coverage to achieve superior performance, by probing the developed models' outputs to examine their decision-making process. Furthermore, we investigated the potential of prompt-based in-context learning of LLMs for recognizing PHI in code-mixed text. RESULTS: The developed methods were evaluated on a code-mixed deidentification corpus of 1700 discharge summaries. We observed that different PHI types had preferences in their occurrences within the different types of language-mixed sentences, and PLMs could effectively recognize PHI by exploiting the learned name regularity. However, the models may exhibit suboptimal results when regularity is weak or mentions contain unknown words that the representations cannot generate well. We also found that the availability of code-mixed training instances is essential for the model's performance. Furthermore, the LLM-based deidentification method was a feasible and appealing approach that can be controlled and enhanced through natural language prompts. CONCLUSIONS: The study contributes to understanding the underlying mechanism of PLMs in addressing the deidentification process in the code-mixed context and highlights the significance of incorporating code-mixed training instances into the model training phase. To support the advancement of research, we created a manipulated subset of the resynthesized data set available for research purposes. Based on the compiled data set, we found that the LLM-based deidentification method is a feasible approach, but carefully crafted prompts are essential to avoid unwanted output. However, the use of such methods in the hospital setting requires careful consideration of data security and privacy concerns. Further research could explore the augmentation of PLMs and LLMs with external knowledge to improve their strength in recognizing rare PHI.


Asunto(s)
Inteligencia Artificial , Registros Electrónicos de Salud , Humanos , Procesamiento de Lenguaje Natural , Privacidad , China
3.
J Proteome Res ; 18(12): 4124-4132, 2019 12 06.
Artículo en Inglés | MEDLINE | ID: mdl-31429573

RESUMEN

When conducting proteomics experiments to detect missing proteins and protein isoforms in the human proteome, it is desirable to use a protease that can yield more unique peptides with properties amenable for mass spectrometry analysis. Though trypsin is currently the most widely used protease, some proteins can yield only a limited number of unique peptides by trypsin digestion. Other proteases and multiple proteases have been applied in reported studies to increase the number of identified proteins and protein sequence coverage. To facilitate the selection of proteases, we developed a web-based resource, called in silico Human Proteome Digestion Map (iHPDM), which contains a comprehensive proteolytic peptide database constructed from human proteins, including isoforms, in neXtProt digested by 15 protease combinations of one or two proteases. iHPDM provides convenient functions and graphical visualizations for users to examine and compare the digestion results of different proteases. Notably, it also supports users to input filtering criteria on digested peptides, e.g., peptide length and uniqueness, to select suitable proteases. iHPDM can facilitate protease selection for shotgun proteomics experiments to identify missing proteins, protein isoforms, and single amino acid variant peptides.


Asunto(s)
Péptido Hidrolasas/metabolismo , Mapeo Peptídico/métodos , Proteoma/metabolismo , Gráficos por Computador , Simulación por Computador , Visualización de Datos , Bases de Datos Factuales , Receptores ErbB/metabolismo , Humanos , Internet , MAP Quinasa Quinasa 1/metabolismo , N-Acetilhexosaminiltransferasas/metabolismo , Isoformas de Proteínas/metabolismo , Proteómica/métodos , Receptores Odorantes/metabolismo , Interfaz Usuario-Computador , gamma-Glutamiltransferasa/metabolismo
4.
BMC Genomics ; 20(Suppl 9): 906, 2019 Dec 24.
Artículo en Inglés | MEDLINE | ID: mdl-31874640

RESUMEN

BACKGROUND: Tandem mass spectrometry allows biologists to identify and quantify protein samples in the form of digested peptide sequences. When performing peptide identification, spectral library search is more sensitive than traditional database search but is limited to peptides that have been previously identified. An accurate tandem mass spectrum prediction tool is thus crucial in expanding the peptide space and increasing the coverage of spectral library search. RESULTS: We propose MS2CNN, a non-linear regression model based on deep convolutional neural networks, a deep learning algorithm. The features for our model are amino acid composition, predicted secondary structure, and physical-chemical features such as isoelectric point, aromaticity, helicity, hydrophobicity, and basicity. MS2CNN was trained with five-fold cross validation on a three-way data split on the large-scale human HCD MS2 dataset of Orbitrap LC-MS/MS downloaded from the National Institute of Standards and Technology. It was then evaluated on a publicly available independent test dataset of human HeLa cell lysate from LC-MS experiments. On average, our model shows better cosine similarity and Pearson correlation coefficient (0.690 and 0.632) than MS2PIP (0.647 and 0.601) and is comparable with pDeep (0.692 and 0.642). Notably, for the more complex MS2 spectra of 3+ peptides, MS2PIP is significantly better than both MS2PIP and pDeep. CONCLUSIONS: We showed that MS2CNN outperforms MS2PIP for 2+ and 3+ peptides and pDeep for 3+ peptides. This implies that MS2CNN, the proposed convolutional neural network model, generates highly accurate MS2 spectra for LC-MS/MS experiments using Orbitrap machines, which can be of great help in protein and peptide identifications. The results suggest that incorporating more data for deep learning model may improve performance.


Asunto(s)
Redes Neurales de la Computación , Espectrometría de Masas en Tándem/métodos , Aprendizaje Profundo , Células HeLa , Humanos , Péptidos/química , Proteínas/química , Análisis de Secuencia de Proteína
5.
Anal Chem ; 91(15): 9403-9406, 2019 08 06.
Artículo en Inglés | MEDLINE | ID: mdl-31305071

RESUMEN

Protein and peptide identification and quantitation are essential tasks in proteomics research and involve a series of steps in analyzing mass spectrometry data. Trans-Proteomic Pipeline (TPP) provides a wide range of useful tools through its web interfaces for analyses such as sequence database search, statistical validation, and quantitation. To utilize the powerful functionality of TPP without the need for manual intervention to launch each step, we developed a software tool, called WinProphet, to create and automatically execute a pipeline for proteomic analyses. It seamlessly integrates with TPP and other external command-line programs, supporting various functionalities, including database search for protein and peptide identification, spectral library construction and search, data-independent acquisition (DIA) data analysis, and isobaric labeling and label-free quantitation. WinProphet is a standalone, installation-free tool with graphical interfaces for users to configure, manage, and automatically execute pipelines. The constructed pipelines can be exported as XML files with all of the parameter settings for reusability and portability. The executable files, user manual, and sample data sets of WinProphet are freely available at  http://ms.iis.sinica.edu.tw/COmics/Software_WinProphet.html .


Asunto(s)
Análisis de Datos , Proteómica/métodos , Programas Informáticos , Interfaz Usuario-Computador , Flujo de Trabajo
6.
Nucleic Acids Res ; 44(W1): W575-80, 2016 Jul 08.
Artículo en Inglés | MEDLINE | ID: mdl-27084943

RESUMEN

MAGIC-web is the first web server, to the best of our knowledge, that performs both untargeted and targeted analyses of mass spectrometry-based glycoproteomics data for site-specific N-linked glycoprotein identification. The first two modules, MAGIC and MAGIC+, are designed for untargeted and targeted analysis, respectively. MAGIC is implemented with our previously proposed novel Y1-ion pattern matching method, which adequately detects Y1- and Y0-ion without prior information of proteins and glycans, and then generates in silico MS(2) spectra that serve as input to a database search engine (e.g. Mascot) to search against a large-scale protein sequence database. On top of that, the newly implemented MAGIC+ allows users to determine glycopeptide sequences using their own protein sequence file. The third module, Reports Integrator, provides the service of combining protein identification results from Mascot and glycan-related information from MAGIC-web to generate a complete site-specific protein-glycan summary report. The last module, Glycan Search, is designed for the users who are interested in finding possible glycan structures with specific numbers and types of monosaccharides. The results from MAGIC, MAGIC+ and Reports Integrator can be downloaded via provided links whereas the annotated spectra and glycan structures can be visualized in the browser. MAGIC-web is accessible from http://ms.iis.sinica.edu.tw/MAGIC-web/index.html.


Asunto(s)
Glicoproteínas/análisis , Glicoproteínas/química , Internet , Polisacáridos/análisis , Polisacáridos/química , Programas Informáticos , Simulación por Computador , Bases de Datos de Proteínas , Glicopéptidos/análisis , Glicopéptidos/química , Humanos , Espectrometría de Masas , Proteómica , Motor de Búsqueda , Interfaz Usuario-Computador , Navegador Web
7.
Anal Chem ; 89(24): 13128-13136, 2017 12 19.
Artículo en Inglés | MEDLINE | ID: mdl-29165996

RESUMEN

Top-down proteomics using liquid chromatogram coupled with mass spectrometry has been increasingly applied for analyzing intact proteins to study genetic variation, alternative splicing, and post-translational modifications (PTMs) of the proteins (proteoforms). However, only a few tools have been developed for charge state deconvolution, monoisotopic/average molecular weight determination and quantitation of proteoforms from LC-MS1 spectra. Though Decon2LS and MASH Suite Pro have been available to provide intraspectrum charge state deconvolution and quantitation, manual processing is still required to quantify proteoforms across multiple MS1 spectra. An automated tool for interspectrum quantitation is a pressing need. Thus, in this paper, we present a user-friendly tool, called iTop-Q (intelligent Top-down Proteomics Quantitation), that automatically performs large-scale proteoform quantitation based on interspectrum abundance in top-down proteomics. Instead of utilizing single spectrum for proteoform quantitation, iTop-Q constructs extracted ion chromatograms (XICs) of possible proteoform peaks across adjacent MS1 spectra to calculate abundances for accurate quantitation. Notably, iTop-Q is implemented with a newly proposed algorithm, called DYAMOND, using dynamic programming for charge state deconvolution. In addition, iTop-Q performs proteoform alignment to support quantitation analysis across replicates/samples. The performance evaluations on an in-house standard data set and a public large-scale yeast lysate data set show that iTop-Q achieves highly accurate quantitation, more consistent quantitation than using intraspectrum quantitation. Furthermore, the DYAMOND algorithm is suitable for high charge state deconvolution and can distinguish shared peaks in coeluting proteoforms. iTop-Q is publicly available for download at http://ms.iis.sinica.edu.tw/COmics/Software_iTop-Q .


Asunto(s)
Algoritmos , Proteínas/análisis , Proteómica , Cromatografía Liquida , Espectrometría de Masas
8.
J Proteome Res ; 14(12): 5396-407, 2015 Dec 04.
Artículo en Inglés | MEDLINE | ID: mdl-26549055

RESUMEN

Protein experiment evidence at protein level from mass spectrometry and antibody experiments are essential to characterize the human proteome. neXtProt (2014-09 release) reported 20 055 human proteins, including 16 491 proteins identified at protein level and 3564 proteins unidentified. Excluding 616 proteins at uncertain level, 2948 proteins were regarded as missing proteins. Missing proteins were unidentified partially due to MS limitations and intrinsic properties of proteins, for example, only appearing in specific diseases or tissues. Despite such reasons, it is desirable to explore issues affecting validation of missing proteins from an "ideal" shotgun analysis of human proteome. We thus performed in silico digestions on the human proteins to generate all in silico fully digested peptides. With these presumed peptides, we investigated the identification of proteins without any unique peptide, the effect of sequence variants on protein identification, difficulties in identifying olfactory receptors, and highly similar proteins. Among all proteins with evidence at transcript level, G protein-coupled receptors and olfactory receptors, based on InterPro classification, were the largest families of proteins and exhibited more frequent variants. To identify missing proteins, the above analyses suggested including sequence variants in protein FASTA for database searching. Furthermore, evidence of unique peptides identified from MS experiments would be crucial for experimentally validating missing proteins.


Asunto(s)
Proteómica/métodos , Secuencia de Aminoácidos , Anexinas/química , Anexinas/genética , Biología Computacional/métodos , Simulación por Computador , Bases de Datos de Proteínas , Variación Genética , Humanos , Interacciones Hidrofóbicas e Hidrofílicas , Espectrometría de Masas , Anotación de Secuencia Molecular , Datos de Secuencia Molecular , Fragmentos de Péptidos/química , Fragmentos de Péptidos/genética , Fragmentos de Péptidos/aislamiento & purificación , Proteolisis , Proteoma/química , Proteoma/genética , Proteoma/aislamiento & purificación , Proteómica/estadística & datos numéricos , Receptores Odorantes/química , Receptores Odorantes/genética , Receptores Odorantes/aislamiento & purificación
9.
PLoS One ; 19(7): e0307176, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39024250

RESUMEN

Cancer immunotherapy enhances the body's natural immune system to combat cancer, offering the advantage of lowered side effects compared to traditional treatments because of its high selectivity and efficacy. Utilizing computational methods to identify tumor T cell antigens (TTCAs) is valuable in unraveling the biological mechanisms and enhancing the effectiveness of immunotherapy. In this study, we present ENCAP, a predictor for TTCA based on ensemble classifiers and diverse sequence features. Sequences were encoded as a feature vector of 4349 entries based on 57 different feature types, followed by feature engineering and hyperparameter optimization for machine learning models, respectively. The selected feature subsets of ENCAP are primarily composed of physicochemical properties, with several features specifically related to hydrophobicity and amphiphilicity. Two publicly available datasets were used for performance evaluation. ENCAP yields an AUC (Area Under the ROC Curve) of 0.768 and an MCC (Matthew's Correlation Coefficient) of 0.522 on the first independent test set. On the second test set, it achieves an AUC of 0.960 and an MCC of 0.789. Performance evaluations show that ENCAP generates 4.8% and 13.5% improvements in MCC over the state-of-the-art methods on two popular TTCA datasets, respectively. For the third test dataset of 71 experimentally validated TTCAs from the literature, ENCAP yields prediction accuracy of 0.873, achieving improvements ranging from 12% to 25.7% compared to three state-of-the-art methods. In general, the prediction accuracy is higher for sequences of fewer hydrophobic residues, and more hydrophilic and charged residues. The source code of ENCAP is freely available at https://github.com/YnnJ456/ENCAP.


Asunto(s)
Antígenos de Neoplasias , Biología Computacional , Antígenos de Neoplasias/inmunología , Humanos , Biología Computacional/métodos , Neoplasias/inmunología , Linfocitos T/inmunología , Aprendizaje Automático , Algoritmos , Curva ROC
10.
Sci Rep ; 14(1): 14387, 2024 06 22.
Artículo en Inglés | MEDLINE | ID: mdl-38909149

RESUMEN

Angiogenesis is a key process for the proliferation and metastatic spread of cancer cells. Anti-angiogenic peptides (AAPs), with the capability of inhibiting angiogenesis, are promising candidates in cancer treatment. We propose AAPL, a sequence-based predictor to identify AAPs with machine learning models of improved prediction accuracy. Each peptide sequence was transformed to a vector of 4335 numeric values according to 58 different feature types, followed by a heuristic algorithm for feature selection. Next, the hyperparameters of six machine learning models were optimized with respect to the feature subset. We considered two datasets, one with entire peptide sequences and the other with 15 amino acids from peptide N-termini. AAPL achieved Matthew's correlation coefficients of 0.671 and 0.756 for independent tests based on the two datasets, respectively, outperforming existing predictors by a range of 5.3% to 24.6%. Further analyses show that AAPL yields higher prediction accuracy for peptides with more hydrophobic residues, and fewer hydrophilic and charged residues. The source code of AAPL is available at https://github.com/yunzheng2002/Anti-angiogenic .


Asunto(s)
Inhibidores de la Angiogénesis , Aprendizaje Automático , Péptidos , Inhibidores de la Angiogénesis/química , Inhibidores de la Angiogénesis/farmacología , Péptidos/química , Péptidos/farmacología , Algoritmos , Secuencia de Aminoácidos , Humanos
11.
Sci Rep ; 13(1): 14119, 2023 08 29.
Artículo en Inglés | MEDLINE | ID: mdl-37644119

RESUMEN

Isobaric labeling relative quantitation is one of the dominating proteomic quantitation technologies. Traditional quantitation pipelines for isobaric-labeled mass spectrometry data are based on sequence database searching. In this study, we present a novel quantitation pipeline that integrates sequence database searching, spectral library searching, and a feature-based peptide-spectrum-match (PSM) filter using various spectral features for filtering. The combined database and spectral library searching results in larger quantitation coverage, and the filter removes PSMs with larger quantitation errors, retaining those with higher quantitation accuracy. Quantitation results show that the proposed pipeline can improve the overall quantitation accuracy at the PSM and protein levels. To our knowledge, this is the first study that utilizes spectral library searching to improve isobaric labeling-based quantitation. For users to conveniently perform the proposed pipeline, we have implemented the feature-based filter being executable on both Windows and Linux platforms; its executable files, user manual, and sample data sets are freely available at https://ms.iis.sinica.edu.tw/comics/Software_FPF.html . Furthermore, with the developed filter, the proposed pipeline is fully compatible with the Trans-Proteomic Pipeline.


Asunto(s)
Bases de Datos de Ácidos Nucleicos , Proteómica , Biblioteca de Genes , Espectrometría de Masas , Péptidos
12.
Sci Rep ; 12(1): 2045, 2022 02 07.
Artículo en Inglés | MEDLINE | ID: mdl-35132134

RESUMEN

Identifying peptides and proteins from mass spectrometry (MS) data, spectral library searching has emerged as a complementary approach to the conventional database searching. However, for the spectrum-centric analysis of data-independent acquisition (DIA) data, spectral library searching has not been widely exploited because existing spectral library search tools are mainly designed and optimized for the analysis of data-dependent acquisition (DDA) data. We present Calibr, a spectral library search tool for spectrum-centric DIA data analysis. Calibr optimizes spectrum preprocessing for pseudo MS2 spectra, generating an 8.11% increase in spectrum-spectrum match (SSM) number and a 7.49% increase in peptide number over the traditional preprocessing approach. When searching against the DDA-based spectral library, Calibr improves SSM number by 17.6-26.65% and peptide number by 18.45-37.31% over two state-of-the-art tools on three different data sets. Searching against the public spectral library from MassIVE, Calibr improves state-of-the-art tools in SSM and peptide numbers by more than 31.49% and 25.24%, respectively, for two data sets. Our analyses indicate higher sensitivity of Calibr results from the use of various spectral similarity measures and statistical scores, coupled with machine learning-based statistical validation for FDR control. Calibr executable files including a graphical user-interface application are available at https://ms.iis.sinica.edu.tw/COmics/Software_CalibrWizard.html and https://sourceforge.net/projects/comics-calibr .


Asunto(s)
Espectrometría de Masas/métodos , Biblioteca de Péptidos , Péptidos/química , Péptidos/genética , Proteínas/química , Proteínas/genética , Proteómica/métodos , Bases de Datos como Asunto , Conjuntos de Datos como Asunto
13.
Artículo en Inglés | MEDLINE | ID: mdl-36612710

RESUMEN

The aim of the current study was to investigate the relationship between nasopharyngeal carcinoma (NPC) and dry eye disease (DED) using the National Health Insurance Research Database (NHIRD) of Taiwan. A retrospective cohort study was conducted, and patients with an NPC diagnosis were included. Next, one NPC patient was matched to four non-NPC participants via demographic data and systemic comorbidities. In total, 4184 and 16,736 participants were enrolled in the NPC and non-NPC groups, respectively. The primary outcome was the development of DED one year after the diagnosis of NPC. Cox proportional hazard regression was applied to estimate the adjusted hazard ratios (aHRs) with 95% confidence intervals (CIs) of DED. In this study, 717 and 2225 DED cases were found in the NPC and non-NPC groups, respectively, and the NPC group showed a significantly higher incidence of DED development compared to the non-NPC group (aHR: 1.45, 95% CI: 1.33−1.58, p < 0.0001) in the multivariable analysis. The other covariates that were positively correlated with DED development included age over 40 years, an education level higher than senior high school, hypertension, DM, allergic pulmonary diseases, allergic otolaryngologic diseases, and allergic dermatological diseases (all p < 0.05). In conclusion, the presence of NPC is an independent risk factor for subsequent DED.


Asunto(s)
Síndromes de Ojo Seco , Hipersensibilidad , Neoplasias Nasofaríngeas , Humanos , Adulto , Carcinoma Nasofaríngeo/epidemiología , Carcinoma Nasofaríngeo/complicaciones , Estudios de Cohortes , Estudios Retrospectivos , Factores de Riesgo , Hipersensibilidad/complicaciones , Síndromes de Ojo Seco/epidemiología , Síndromes de Ojo Seco/etiología , Taiwán/epidemiología , Neoplasias Nasofaríngeas/epidemiología , Neoplasias Nasofaríngeas/complicaciones
14.
Sci Rep ; 11(1): 2233, 2021 01 26.
Artículo en Inglés | MEDLINE | ID: mdl-33500498

RESUMEN

Mass spectrometry-based proteomics using isobaric labeling for multiplex quantitation has become a popular approach for proteomic studies. We present Multi-Q 2, an isobaric-labeling quantitation tool which can yield the largest quantitation coverage and improved quantitation accuracy compared to three state-of-the-art methods. Multi-Q 2 supports identification results from several popular proteomic data analysis platforms for quantitation, offering up to 12% improvement in quantitation coverage for accepting identification results from multiple search engines when compared with MaxQuant and PatternLab. It is equipped with various quantitation algorithms, including a ratio compression correction algorithm, and results in up to 336 algorithmic combinations. Systematic evaluation shows different algorithmic combinations have different strengths and are suitable for different situations. We also demonstrate that the flexibility of Multi-Q 2 in customizing algorithmic combination can lead to improved quantitation accuracy over existing tools. Moreover, the use of complementary algorithmic combinations can be an effective strategy to enhance sensitivity when searching for biomarkers from differentially expressed proteins in proteomic experiments. Multi-Q 2 provides interactive graphical interfaces to process quantitation and to display ratios at protein, peptide, and spectrum levels. It also supports a heatmap module, enabling users to cluster proteins based on their abundance ratios and to visualize the clustering results. Multi-Q 2 executable files, sample data sets, and user manual are freely available at http://ms.iis.sinica.edu.tw/COmics/Software_Multi-Q2.html .

15.
BMC Bioinformatics ; 10 Suppl 15: S8, 2009 Dec 03.
Artículo en Inglés | MEDLINE | ID: mdl-19958518

RESUMEN

BACKGROUND: The study of protein subcellular localization (PSL) is important for elucidating protein functions involved in various cellular processes. However, determining the localization sites of a protein through wet-lab experiments can be time-consuming and labor-intensive. Thus, computational approaches become highly desirable. Most of the PSL prediction systems are established for single-localized proteins. However, a significant number of eukaryotic proteins are known to be localized into multiple subcellular organelles. Many studies have shown that proteins may simultaneously locate or move between different cellular compartments and be involved in different biological processes with different roles. RESULTS: In this study, we propose a knowledge based method, called KnowPredsite, to predict the localization site(s) of both single-localized and multi-localized proteins. Based on the local similarity, we can identify the "related sequences" for prediction. We construct a knowledge base to record the possible sequence variations for protein sequences. When predicting the localization annotation of a query protein, we search against the knowledge base and used a scoring mechanism to determine the predicted sites. We downloaded the dataset from ngLOC, which consisted of ten distinct subcellular organelles from 1923 species, and performed ten-fold cross validation experiments to evaluate KnowPred site's performance. The experiment results show that KnowPred site achieves higher prediction accuracy than ngLOC and Blast-hit method. For single-localized proteins, the overall accuracy of KnowPred site is 91.7%. For multi-localized proteins, the overall accuracy of KnowPred site is 72.1%, which is significantly higher than that of ngLOC by 12.4%. Notably, half of the proteins in the dataset that cannot find any Blast hit sequence above a specified threshold can still be correctly predicted by KnowPred site. CONCLUSION: KnowPred site demonstrates the power of identifying related sequences in the knowledge base. The experiment results show that even though the sequence similarity is low, the local similarity is effective for prediction. Experiment results show that KnowPred site is a highly accurate prediction method for both single- and multi-localized proteins. It is worth-mentioning the prediction process of KnowPred site is transparent and biologically interpretable and it shows a set of template sequences to generate the prediction result. The KnowPred site prediction server is available at http://bio-cluster.iis.sinica.edu.tw/kbloc/.


Asunto(s)
Biología Computacional/métodos , Proteínas/análisis , Proteínas/química , Bases de Datos de Proteínas , Eucariontes , Bases del Conocimiento , Análisis de Secuencia de Proteína/métodos , Programas Informáticos
16.
Bioinformatics ; 24(23): 2691-7, 2008 Dec 01.
Artículo en Inglés | MEDLINE | ID: mdl-18974075

RESUMEN

MOTIVATION: Regulatory proteases modulate proteomic dynamics with a spectrum of specificities against substrate proteins. Predictions of the substrate sites in a proteome for the proteases would facilitate understanding the biological functions of the proteases. High-throughput experiments could generate suitable datasets for machine learning to grasp complex relationships between the substrate sequences and the enzymatic specificities. But the capability in predicting protease substrate sites by integrating the machine learning algorithms with the experimental methodology has yet to be demonstrated. RESULTS: Factor Xa, a key regulatory protease in the blood coagulation system, was used as model system, for which effective substrate site predictors were developed and benchmarked. The predictors were derived from bootstrap aggregation (machine learning) algorithms trained with data obtained from multilevel substrate phage display experiments. The experimental sampling and computational learning on substrate specificities can be generalized to proteases for which the active forms are available for the in vitro experiments. AVAILABILITY: http://asqa.iis.sinica.edu.tw/fXaWeb/


Asunto(s)
Inteligencia Artificial , Biología Computacional/métodos , Péptido Hidrolasas/química , Biblioteca de Péptidos , Algoritmos , Animales , Sitios de Unión , Simulación por Computador , Bases de Datos de Proteínas , Humanos , Cinética , Modelos Biológicos , Especificidad por Sustrato
17.
Sci Rep ; 9(1): 15975, 2019 11 04.
Artículo en Inglés | MEDLINE | ID: mdl-31685900

RESUMEN

N-linked glycosylation is one of the predominant post-translational modifications involved in a number of biological functions. Since experimental characterization of glycosites is challenging, glycosite prediction is crucial. Several predictors have been made available and report high performance. Most of them evaluate their performance at every asparagine in protein sequences, not confined to asparagine in the N-X-S/T sequon. In this paper, we present N-GlyDE, a two-stage prediction tool trained on rigorously-constructed non-redundant datasets to predict N-linked glycosites in the human proteome. The first stage uses a protein similarity voting algorithm trained  on both glycoproteins and non-glycoproteins to predict a score for a protein to improve glycosite prediction. The second stage uses a support vector machine to predict N-linked glycosites by utilizing features of gapped dipeptides, pattern-based predicted surface accessibility, and predicted secondary structure. N-GlyDE's final predictions are derived from a weight adjustment of the second-stage prediction results based on the first-stage prediction score. Evaluated on N-X-S/T sequons of an independent dataset comprised of 53 glycoproteins and 33 non-glycoproteins, N-GlyDE achieves an accuracy and MCC of 0.740 and 0.499, respectively, outperforming the compared tools. The N-GlyDE web server is available at http://bioapp.iis.sinica.edu.tw/N-GlyDE/ .

18.
J Bioinform Comput Biol ; 4(6): 1287-307, 2006 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-17245815

RESUMEN

Local structure prediction can facilitate ab initio structure prediction, protein threading, and remote homology detection. However, the accuracy of existing methods is limited. In this paper, we propose a knowledge-based prediction method that assigns a measure called the local match rate to each position of an amino acid sequence to estimate the confidence of our method. Empirically, the accuracy of the method correlates positively with the local match rate; therefore, we employ it to predict the local structures of positions with a high local match rate. For positions with a low local match rate, we propose a neural network prediction method. To better utilize the knowledge-based and neural network methods, we design a hybrid prediction method, HYPLOSP (HYbrid method to Protein LOcal Structure Prediction) that combines both methods. To evaluate the performance of the proposed methods, we first perform cross-validation experiments by applying our knowledge-based method, a neural network method, and HYPLOSP to a large dataset of 3,925 protein chains. We test our methods extensively on three different structural alphabets and evaluate their performance by two widely used criteria, Maximum Deviation of backbone torsion Angle (MDA) and Q(N), which is similar to Q(3) in secondary structure prediction. We then compare HYPLOSP with three previous studies using a dataset of 56 new protein chains. HYPLOSP shows promising results in terms of MDA and Q(N) accuracy and demonstrates its alphabet-independent capability.


Asunto(s)
Algoritmos , Inteligencia Artificial , Reconocimiento de Normas Patrones Automatizadas/métodos , Proteínas/química , Alineación de Secuencia/métodos , Análisis de Secuencia de Proteína/métodos , Programas Informáticos , Secuencia de Aminoácidos , Simulación por Computador , Modelos Químicos , Modelos Moleculares , Datos de Secuencia Molecular , Homología de Secuencia de Aminoácido
19.
PLoS One ; 11(1): e0146112, 2016.
Artículo en Inglés | MEDLINE | ID: mdl-26784691

RESUMEN

Efficient and accurate quantitation of metabolites from LC-MS data has become an important topic. Here we present an automated tool, called iMet-Q (intelligent Metabolomic Quantitation), for label-free metabolomics quantitation from high-throughput MS1 data. By performing peak detection and peak alignment, iMet-Q provides a summary of quantitation results and reports ion abundance at both replicate level and sample level. Furthermore, it gives the charge states and isotope ratios of detected metabolite peaks to facilitate metabolite identification. An in-house standard mixture and a public Arabidopsis metabolome data set were analyzed by iMet-Q. Three public quantitation tools, including XCMS, MetAlign, and MZmine 2, were used for performance comparison. From the mixture data set, seven standard metabolites were detected by the four quantitation tools, for which iMet-Q had a smaller quantitation error of 12% in both profile and centroid data sets. Our tool also correctly determined the charge states of seven standard metabolites. By searching the mass values for those standard metabolites against Human Metabolome Database, we obtained a total of 183 metabolite candidates. With the isotope ratios calculated by iMet-Q, 49% (89 out of 183) metabolite candidates were filtered out. From the public Arabidopsis data set reported with two internal standards and 167 elucidated metabolites, iMet-Q detected all of the peaks corresponding to the internal standards and 167 metabolites. Meanwhile, our tool had small abundance variation (≤ 0.19) when quantifying the two internal standards and had higher abundance correlation (≥ 0.92) when quantifying the 167 metabolites. iMet-Q provides user-friendly interfaces and is publicly available for download at http://ms.iis.sinica.edu.tw/comics/Software_iMet-Q.html.


Asunto(s)
Metaboloma , Metabolómica/métodos , Programas Informáticos , Arabidopsis/metabolismo , Humanos
20.
Genome Biol ; 17(1): 184, 2016 09 07.
Artículo en Inglés | MEDLINE | ID: mdl-27604469

RESUMEN

BACKGROUND: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. RESULTS: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. CONCLUSIONS: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent.


Asunto(s)
Biología Computacional , Proteínas/química , Programas Informáticos , Relación Estructura-Actividad , Algoritmos , Bases de Datos de Proteínas , Ontología de Genes , Humanos , Anotación de Secuencia Molecular , Proteínas/genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA