Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 90
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38609331

RESUMEN

Natural language processing (NLP) has become an essential technique in various fields, offering a wide range of possibilities for analyzing data and developing diverse NLP tasks. In the biomedical domain, understanding the complex relationships between compounds and proteins is critical, especially in the context of signal transduction and biochemical pathways. Among these relationships, protein-protein interactions (PPIs) are of particular interest, given their potential to trigger a variety of biological reactions. To improve the ability to predict PPI events, we propose the protein event detection dataset (PEDD), which comprises 6823 abstracts, 39 488 sentences and 182 937 gene pairs. Our PEDD dataset has been utilized in the AI CUP Biomedical Paper Analysis competition, where systems are challenged to predict 12 different relation types. In this paper, we review the state-of-the-art relation extraction research and provide an overview of the PEDD's compilation process. Furthermore, we present the results of the PPI extraction competition and evaluate several language models' performances on the PEDD. This paper's outcomes will provide a valuable roadmap for future studies on protein event detection in NLP. By addressing this critical challenge, we hope to enable breakthroughs in drug discovery and enhance our understanding of the molecular mechanisms underlying various diseases.


Asunto(s)
Descubrimiento de Drogas , Procesamiento de Lenguaje Natural , Transducción de Señal
2.
Brief Bioinform ; 21(6): 2219-2238, 2020 12 01.
Artículo en Inglés | MEDLINE | ID: mdl-32602538

RESUMEN

Natural language processing (NLP) is widely applied in biological domains to retrieve information from publications. Systems to address numerous applications exist, such as biomedical named entity recognition (BNER), named entity normalization (NEN) and protein-protein interaction extraction (PPIE). High-quality datasets can assist the development of robust and reliable systems; however, due to the endless applications and evolving techniques, the annotations of benchmark datasets may become outdated and inappropriate. In this study, we first review commonlyused BNER datasets and their potential annotation problems such as inconsistency and low portability. Then, we introduce a revised version of the JNLPBA dataset that solves potential problems in the original and use state-of-the-art named entity recognition systems to evaluate its portability to different kinds of biomedical literature, including protein-protein interaction and biology events. Lastly, we introduce an ensembled biomedical entity dataset (EBED) by extending the revised JNLPBA dataset with PubMed Central full-text paragraphs, figure captions and patent abstracts. This EBED is a multi-task dataset that covers annotations including gene, disease and chemical entities. In total, it contains 85000 entity mentions, 25000 entity mentions with database identifiers and 5000 attribute tags. To demonstrate the usage of the EBED, we review the BNER track from the AI CUP Biomedical Paper Analysis challenge. Availability: The revised JNLPBA dataset is available at https://iasl-btm.iis.sinica.edu.tw/BNER/Content/Re vised_JNLPBA.zip. The EBED dataset is available at https://iasl-btm.iis.sinica.edu.tw/BNER/Content/AICUP _EBED_dataset.rar. Contact: Email: thtsai@g.ncu.edu.tw, Tel. 886-3-4227151 ext. 35203, Fax: 886-3-422-2681 Email: hsu@iis.sinica.edu.tw, Tel. 886-2-2788-3799 ext. 2211, Fax: 886-2-2782-4814 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online.


Asunto(s)
Minería de Datos , Almacenamiento y Recuperación de la Información , Procesamiento de Lenguaje Natural , Benchmarking , Biología Computacional/métodos , Minería de Datos/métodos , Bases de Datos Factuales , Redes Neurales de la Computación , PubMed , Programas Informáticos , Encuestas y Cuestionarios
3.
Bioinformatics ; 37(3): 404-412, 2021 04 20.
Artículo en Inglés | MEDLINE | ID: mdl-32810217

RESUMEN

MOTIVATION: Natural Language Processing techniques are constantly being advanced to accommodate the influx of data as well as to provide exhaustive and structured knowledge dissemination. Within the biomedical domain, relation detection between bio-entities known as the Bio-Entity Relation Extraction (BRE) task has a critical function in knowledge structuring. Although recent advances in deep learning-based biomedical domain embedding have improved BRE predictive analytics, these works are often task selective or use external knowledge-based pre-/post-processing. In addition, deep learning-based models do not account for local syntactic contexts, which have improved data representation in many kernel classifier-based models. In this study, we propose a universal BRE model, i.e. LBERT, which is a Lexically aware Transformer-based Bidirectional Encoder Representation model, and which explores both local and global contexts representations for sentence-level classification tasks. RESULTS: This article presents one of the most exhaustive BRE studies ever conducted over five different bio-entity relation types. Our model outperforms state-of-the-art deep learning models in protein-protein interaction (PPI), drug-drug interaction and protein-bio-entity relation classification tasks by 0.02%, 11.2% and 41.4%, respectively. LBERT representations show a statistically significant improvement over BioBERT in detecting true bio-entity relation for large corpora like PPI. Our ablation studies clearly indicate the contribution of the lexical features and distance-adjusted attention in improving prediction performance by learning additional local semantic context along with bi-directionally learned global context. AVAILABILITY AND IMPLEMENTATION: Github. https://github.com/warikoone/LBERT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Bases del Conocimiento , Procesamiento de Lenguaje Natural , Lenguaje , Proyectos de Investigación , Semántica
4.
Nucleic Acids Res ; 48(D1): D148-D154, 2020 01 08.
Artículo en Inglés | MEDLINE | ID: mdl-31647101

RESUMEN

MicroRNAs (miRNAs) are small non-coding RNAs (typically consisting of 18-25 nucleotides) that negatively control expression of target genes at the post-transcriptional level. Owing to the biological significance of miRNAs, miRTarBase was developed to provide comprehensive information on experimentally validated miRNA-target interactions (MTIs). To date, the database has accumulated >13,404 validated MTIs from 11,021 articles from manual curations. In this update, a text-mining system was incorporated to enhance the recognition of MTI-related articles by adopting a scoring system. In addition, a variety of biological databases were integrated to provide information on the regulatory network of miRNAs and its expression in blood. Not only targets of miRNAs but also regulators of miRNAs are provided to users for investigating the up- and downstream regulations of miRNAs. Moreover, the number of MTIs with high-throughput experimental evidence increased remarkably (validated by CLIP-seq technology). In conclusion, these improvements promote the miRTarBase as one of the most comprehensively annotated and experimentally validated miRNA-target interaction databases. The updated version of miRTarBase is now available at http://miRTarBase.cuhk.edu.cn/.


Asunto(s)
Bases de Datos de Ácidos Nucleicos , MicroARNs/metabolismo , MicroARN Circulante/metabolismo , Minería de Datos , Regulación de la Expresión Génica , ARN Mensajero/metabolismo , Interfaz Usuario-Computador
5.
BMC Bioinformatics ; 22(1): 389, 2021 Jul 30.
Artículo en Inglés | MEDLINE | ID: mdl-34330209

RESUMEN

BACKGROUND: Antimicrobial peptides (AMPs) are oligopeptides that act as crucial components of innate immunity, naturally occur in all multicellular organisms, and are involved in the first line of defense function. Recent studies showed that AMPs perpetuate great potential that is not limited to antimicrobial activity. They are also crucial regulators of host immune responses that can modulate a wide range of activities, such as immune regulation, wound healing, and apoptosis. However, a microorganism's ability to adapt and to resist existing antibiotics triggered the scientific community to develop alternatives to conventional antibiotics. Therefore, to address this issue, we proposed Co-AMPpred, an in silico-aided AMP prediction method based on compositional features of amino acid residues to classify AMPs and non-AMPs. RESULTS: In our study, we developed a prediction method that incorporates composition-based sequence and physicochemical features into various machine-learning algorithms. Then, the boruta feature-selection algorithm was used to identify discriminative biological features. Furthermore, we only used discriminative biological features to develop our model. Additionally, we performed a stratified tenfold cross-validation technique to validate the predictive performance of our AMP prediction model and evaluated on the independent holdout test dataset. A benchmark dataset was collected from previous studies to evaluate the predictive performance of our model. CONCLUSIONS: Experimental results show that combining composition-based and physicochemical features outperformed existing methods on both the benchmark training dataset and a reduced training dataset. Finally, our proposed method achieved 80.8% accuracies and 0.871 area under the receiver operating characteristic curve by evaluating on independent test set. Our code and datasets are available at https://github.com/onkarS23/CoAMPpred .


Asunto(s)
Algoritmos , Aprendizaje Automático , Simulación por Computador , Proteínas Citotóxicas Formadoras de Poros , Curva ROC
6.
Biochemistry ; 59(34): 3078-3088, 2020 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-31454239

RESUMEN

Carbohydrates make up one of the four major classes of biomolecules, often conjugated with proteins as glycoproteins or with lipids as glycolipids, and participate in many important biochemical functions in living species. However, glycoproteins or glycolipids often exist as mixtures, and as a consequence, it is difficult to isolate individual glycoproteins or glycolipids as pure forms to understand the role carbohydrates play in the glycoconjugate. Currently, the only feasible way to obtain pure glycoconjugates is through synthesis, and of the many methods developed for the synthesis of oligosaccharides, those with automatic and programmable potential are considered to be more effective for addressing the issues of carbohydrate diversity and related functions. In this Perspective, we describe how data science, including algorithm and machine learning, can be used to assist the chemical synthesis of oligosaccharide in a programmable and one-pot manner and how the programmable method can be used to accelerate the construction of diverse oligosaccharides to facilitate our understanding of glycosylation in biology.


Asunto(s)
Técnicas de Química Sintética/métodos , Oligosacáridos/síntesis química , Aprendizaje Automático , Oligosacáridos/química
7.
BMC Genomics ; 21(1): 182, 2020 Feb 24.
Artículo en Inglés | MEDLINE | ID: mdl-32093618

RESUMEN

BACKGROUND: Personal genomics and comparative genomics are becoming more important in clinical practice and genome research. Both fields require sequence alignment to discover sequence conservation and variation. Though many methods have been developed, some are designed for small genome comparison while some are not efficient for large genome comparison. Moreover, most existing genome comparison tools have not been evaluated the correctness of sequence alignments systematically. A wrong sequence alignment would produce false sequence variants. RESULTS: In this study, we present GSAlign that handles large genome sequence alignment efficiently and identifies sequence variants from the alignment result. GSAlign is an efficient sequence alignment tool for intra-species genomes. It identifies sequence variations from the sequence alignments. We estimate performance by measuring the correctness of predicted sequence variations. The experiment results demonstrated that GSAlign is not only faster than most existing state-of-the-art methods, but also identifies sequence variants with high accuracy. CONCLUSIONS: As more genome sequences become available, the demand for genome comparison is increasing. Therefore an efficient and robust algorithm is most desirable. We believe GSAlign can be a useful tool. It exhibits the abilities of ultra-fast alignment as well as high accuracy and sensitivity for detecting sequence variations.


Asunto(s)
Genoma , Genómica/métodos , Alineación de Secuencia/métodos , Programas Informáticos , Algoritmos , Análisis de Secuencia de ADN
8.
Nucleic Acids Res ; 46(D1): D296-D302, 2018 01 04.
Artículo en Inglés | MEDLINE | ID: mdl-29126174

RESUMEN

MicroRNAs (miRNAs) are small non-coding RNAs of ∼ 22 nucleotides that are involved in negative regulation of mRNA at the post-transcriptional level. Previously, we developed miRTarBase which provides information about experimentally validated miRNA-target interactions (MTIs). Here, we describe an updated database containing 422 517 curated MTIs from 4076 miRNAs and 23 054 target genes collected from over 8500 articles. The number of MTIs curated by strong evidence has increased ∼1.4-fold since the last update in 2016. In this updated version, target sites validated by reporter assay that are available in the literature can be downloaded. The target site sequence can extract new features for analysis via a machine learning approach which can help to evaluate the performance of miRNA-target prediction tools. Furthermore, different ways of browsing enhance user browsing specific MTIs. With these improvements, miRTarBase serves as more comprehensively annotated, experimentally validated miRNA-target interactions databases in the field of miRNA related research. miRTarBase is available at http://miRTarBase.mbc.nctu.edu.tw/.


Asunto(s)
Bases de Datos Genéticas , MicroARNs/metabolismo , ARN Mensajero/metabolismo , Minería de Datos , Humanos , ARN Mensajero/química , Interfaz Usuario-Computador
9.
Anal Chem ; 91(15): 9403-9406, 2019 08 06.
Artículo en Inglés | MEDLINE | ID: mdl-31305071

RESUMEN

Protein and peptide identification and quantitation are essential tasks in proteomics research and involve a series of steps in analyzing mass spectrometry data. Trans-Proteomic Pipeline (TPP) provides a wide range of useful tools through its web interfaces for analyses such as sequence database search, statistical validation, and quantitation. To utilize the powerful functionality of TPP without the need for manual intervention to launch each step, we developed a software tool, called WinProphet, to create and automatically execute a pipeline for proteomic analyses. It seamlessly integrates with TPP and other external command-line programs, supporting various functionalities, including database search for protein and peptide identification, spectral library construction and search, data-independent acquisition (DIA) data analysis, and isobaric labeling and label-free quantitation. WinProphet is a standalone, installation-free tool with graphical interfaces for users to configure, manage, and automatically execute pipelines. The constructed pipelines can be exported as XML files with all of the parameter settings for reusability and portability. The executable files, user manual, and sample data sets of WinProphet are freely available at  http://ms.iis.sinica.edu.tw/COmics/Software_WinProphet.html .


Asunto(s)
Análisis de Datos , Proteómica/métodos , Programas Informáticos , Interfaz Usuario-Computador , Flujo de Trabajo
10.
Bioinformatics ; 34(2): 190-197, 2018 Jan 15.
Artículo en Inglés | MEDLINE | ID: mdl-28968831

RESUMEN

MOTIVATION: In recent years, the massively parallel cDNA sequencing (RNA-Seq) technologies have become a powerful tool to provide high resolution measurement of expression and high sensitivity in detecting low abundance transcripts. However, RNA-seq data requires a huge amount of computational efforts. The very fundamental and critical step is to align each sequence fragment against the reference genome. Various de novo spliced RNA aligners have been developed in recent years. Though these aligners can handle spliced alignment and detect splice junctions, some challenges still remain to be solved. With the advances in sequencing technologies and the ongoing collection of sequencing data in the ENCODE project, more efficient alignment algorithms are highly demanded. Most read mappers follow the conventional seed-and-extend strategy to deal with inexact matches for sequence alignment. However, the extension is much more time consuming than the seeding step. RESULTS: We proposed a novel RNA-seq de novo mapping algorithm, call DART, which adopts a partitioning strategy to avoid the extension step. The experiment results on synthetic datasets and real NGS datasets showed that DART is a highly efficient aligner that yields the highest or comparable sensitivity and accuracy compared to most state-of-the-art aligners, and more importantly, it spends the least amount of time among the selected aligners. AVAILABILITY AND IMPLEMENTATION: https://github.com/hsinnan75/DART. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

11.
Bioinformatics ; 33(15): 2281-2287, 2017 Aug 01.
Artículo en Inglés | MEDLINE | ID: mdl-28379292

RESUMEN

MOTIVATION: Next-generation sequencing (NGS) provides a great opportunity to investigate genome-wide variation at nucleotide resolution. Due to the huge amount of data, NGS applications require very fast and accurate alignment algorithms. Most existing algorithms for read mapping basically adopt seed-and-extend strategy, which is sequential in nature and takes much longer time on longer reads. RESULTS: We develop a divide-and-conquer algorithm, called Kart, which can process long reads as fast as short reads by dividing a read into small fragments that can be aligned independently. Our experiment result indicates that the average size of fragments requiring the more time-consuming gapped alignment is around 20 bp regardless of the original read length. Furthermore, it can tolerate much higher error rates. The experiments show that Kart spends much less time on longer reads than other aligners and still produce reliable alignments even when the error rate is as high as 15%. AVAILABILITY AND IMPLEMENTATION: Kart is available at https://github.com/hsinnan75/Kart/ . CONTACT: hsu@iis.sinica.edu.tw. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Alineación de Secuencia/métodos , Programas Informáticos , Algoritmos , Variación Genética , Genoma Humano , Humanos , Análisis de Secuencia de ADN/métodos
12.
Nucleic Acids Res ; 44(W1): W575-80, 2016 Jul 08.
Artículo en Inglés | MEDLINE | ID: mdl-27084943

RESUMEN

MAGIC-web is the first web server, to the best of our knowledge, that performs both untargeted and targeted analyses of mass spectrometry-based glycoproteomics data for site-specific N-linked glycoprotein identification. The first two modules, MAGIC and MAGIC+, are designed for untargeted and targeted analysis, respectively. MAGIC is implemented with our previously proposed novel Y1-ion pattern matching method, which adequately detects Y1- and Y0-ion without prior information of proteins and glycans, and then generates in silico MS(2) spectra that serve as input to a database search engine (e.g. Mascot) to search against a large-scale protein sequence database. On top of that, the newly implemented MAGIC+ allows users to determine glycopeptide sequences using their own protein sequence file. The third module, Reports Integrator, provides the service of combining protein identification results from Mascot and glycan-related information from MAGIC-web to generate a complete site-specific protein-glycan summary report. The last module, Glycan Search, is designed for the users who are interested in finding possible glycan structures with specific numbers and types of monosaccharides. The results from MAGIC, MAGIC+ and Reports Integrator can be downloaded via provided links whereas the annotated spectra and glycan structures can be visualized in the browser. MAGIC-web is accessible from http://ms.iis.sinica.edu.tw/MAGIC-web/index.html.


Asunto(s)
Glicoproteínas/análisis , Glicoproteínas/química , Internet , Polisacáridos/análisis , Polisacáridos/química , Programas Informáticos , Simulación por Computador , Bases de Datos de Proteínas , Glicopéptidos/análisis , Glicopéptidos/química , Humanos , Espectrometría de Masas , Proteómica , Motor de Búsqueda , Interfaz Usuario-Computador , Navegador Web
13.
Nucleic Acids Res ; 44(D1): D239-47, 2016 Jan 04.
Artículo en Inglés | MEDLINE | ID: mdl-26590260

RESUMEN

MicroRNAs (miRNAs) are small non-coding RNAs of approximately 22 nucleotides, which negatively regulate the gene expression at the post-transcriptional level. This study describes an update of the miRTarBase (http://miRTarBase.mbc.nctu.edu.tw/) that provides information about experimentally validated miRNA-target interactions (MTIs). The latest update of the miRTarBase expanded it to identify systematically Argonaute-miRNA-RNA interactions from 138 crosslinking and immunoprecipitation sequencing (CLIP-seq) data sets that were generated by 21 independent studies. The database contains 4966 articles, 7439 strongly validated MTIs (using reporter assays or western blots) and 348 007 MTIs from CLIP-seq. The number of MTIs in the miRTarBase has increased around 7-fold since the 2014 miRTarBase update. The miRNA and gene expression profiles from The Cancer Genome Atlas (TCGA) are integrated to provide an effective overview of this exponential growth in the miRNA experimental data. These improvements make the miRTarBase one of the more comprehensively annotated, experimentally validated miRNA-target interactions databases and motivate additional miRNA research efforts.


Asunto(s)
Bases de Datos de Ácidos Nucleicos , MicroARNs/metabolismo , ARN Mensajero/metabolismo , Enfermedad/genética , Perfilación de la Expresión Génica , Humanos , ARN Mensajero/química , Análisis de Secuencia de ARN
14.
Anal Chem ; 89(24): 13128-13136, 2017 12 19.
Artículo en Inglés | MEDLINE | ID: mdl-29165996

RESUMEN

Top-down proteomics using liquid chromatogram coupled with mass spectrometry has been increasingly applied for analyzing intact proteins to study genetic variation, alternative splicing, and post-translational modifications (PTMs) of the proteins (proteoforms). However, only a few tools have been developed for charge state deconvolution, monoisotopic/average molecular weight determination and quantitation of proteoforms from LC-MS1 spectra. Though Decon2LS and MASH Suite Pro have been available to provide intraspectrum charge state deconvolution and quantitation, manual processing is still required to quantify proteoforms across multiple MS1 spectra. An automated tool for interspectrum quantitation is a pressing need. Thus, in this paper, we present a user-friendly tool, called iTop-Q (intelligent Top-down Proteomics Quantitation), that automatically performs large-scale proteoform quantitation based on interspectrum abundance in top-down proteomics. Instead of utilizing single spectrum for proteoform quantitation, iTop-Q constructs extracted ion chromatograms (XICs) of possible proteoform peaks across adjacent MS1 spectra to calculate abundances for accurate quantitation. Notably, iTop-Q is implemented with a newly proposed algorithm, called DYAMOND, using dynamic programming for charge state deconvolution. In addition, iTop-Q performs proteoform alignment to support quantitation analysis across replicates/samples. The performance evaluations on an in-house standard data set and a public large-scale yeast lysate data set show that iTop-Q achieves highly accurate quantitation, more consistent quantitation than using intraspectrum quantitation. Furthermore, the DYAMOND algorithm is suitable for high charge state deconvolution and can distinguish shared peaks in coeluting proteoforms. iTop-Q is publicly available for download at http://ms.iis.sinica.edu.tw/COmics/Software_iTop-Q .


Asunto(s)
Algoritmos , Proteínas/análisis , Proteómica , Cromatografía Liquida , Espectrometría de Masas
16.
J Proteome Res ; 14(12): 5396-407, 2015 Dec 04.
Artículo en Inglés | MEDLINE | ID: mdl-26549055

RESUMEN

Protein experiment evidence at protein level from mass spectrometry and antibody experiments are essential to characterize the human proteome. neXtProt (2014-09 release) reported 20 055 human proteins, including 16 491 proteins identified at protein level and 3564 proteins unidentified. Excluding 616 proteins at uncertain level, 2948 proteins were regarded as missing proteins. Missing proteins were unidentified partially due to MS limitations and intrinsic properties of proteins, for example, only appearing in specific diseases or tissues. Despite such reasons, it is desirable to explore issues affecting validation of missing proteins from an "ideal" shotgun analysis of human proteome. We thus performed in silico digestions on the human proteins to generate all in silico fully digested peptides. With these presumed peptides, we investigated the identification of proteins without any unique peptide, the effect of sequence variants on protein identification, difficulties in identifying olfactory receptors, and highly similar proteins. Among all proteins with evidence at transcript level, G protein-coupled receptors and olfactory receptors, based on InterPro classification, were the largest families of proteins and exhibited more frequent variants. To identify missing proteins, the above analyses suggested including sequence variants in protein FASTA for database searching. Furthermore, evidence of unique peptides identified from MS experiments would be crucial for experimentally validating missing proteins.


Asunto(s)
Proteómica/métodos , Secuencia de Aminoácidos , Anexinas/química , Anexinas/genética , Biología Computacional/métodos , Simulación por Computador , Bases de Datos de Proteínas , Variación Genética , Humanos , Interacciones Hidrofóbicas e Hidrofílicas , Espectrometría de Masas , Anotación de Secuencia Molecular , Datos de Secuencia Molecular , Fragmentos de Péptidos/química , Fragmentos de Péptidos/genética , Fragmentos de Péptidos/aislamiento & purificación , Proteolisis , Proteoma/química , Proteoma/genética , Proteoma/aislamiento & purificación , Proteómica/estadística & datos numéricos , Receptores Odorantes/química , Receptores Odorantes/genética , Receptores Odorantes/aislamiento & purificación
17.
Anal Chem ; 87(4): 2143-51, 2015 Feb 17.
Artículo en Inglés | MEDLINE | ID: mdl-25543920

RESUMEN

Metabolite identification remains a bottleneck in mass spectrometry (MS)-based metabolomics. Currently, this process relies heavily on tandem mass spectrometry (MS/MS) spectra generated separately for peaks of interest identified from previous MS runs. Such a delayed and labor-intensive procedure creates a barrier to automation. Further, information embedded in MS data has not been used to its full extent for metabolite identification. Multimers, adducts, multiply charged ions, and fragments of given metabolites occupy a substantial proportion (40-80%) of the peaks of a quantitation result. However, extensive information on these derivatives, especially fragments, may facilitate metabolite identification. We propose a procedure with automation capability to group and annotate peaks associated with the same metabolite in the quantitation results of opposite modes and to integrate this information for metabolite identification. In addition to the conventional mass and isotope ratio matches, we would match annotated fragments with low-energy MS/MS spectra in public databases. For identification of metabolites without accessible MS/MS spectra, we have developed characteristic fragment and common substructure matches. The accuracy and effectiveness of the procedure were evaluated using one public and two in-house liquid chromatography-mass spectrometry (LC-MS) data sets. The procedure accurately identified 89% of 28 standard metabolites with derivative ions in the data sets. With respect to effectiveness, the procedure confidently identified the correct chemical formula of at least 42% of metabolites with derivative ions via MS/MS spectrum, characteristic fragment, and common substructure matches. The confidence level was determined according to the fulfilled identification criteria of various matches and relative retention time.


Asunto(s)
Metabolómica/métodos , Espectrometría de Masas en Tándem/métodos , Animales , Cromatografía Liquida/métodos , Diabetes Mellitus Experimental/metabolismo , Dieta , Iones/análisis , Iones/metabolismo , Metaboloma , Ratones , Ratas
18.
Anal Chem ; 87(4): 2466-73, 2015 Feb 17.
Artículo en Inglés | MEDLINE | ID: mdl-25629585

RESUMEN

Glycosylation is a highly complex modification influencing the functions and activities of proteins. Interpretation of intact glycopeptide spectra is crucial but challenging. In this paper, we present a mass spectrometry-based automated glycopeptide identification platform (MAGIC) to identify peptide sequences and glycan compositions directly from intact N-linked glycopeptide collision-induced-dissociation spectra. The identification of the Y1 (peptideY0 + GlcNAc) ion is critical for the correct analysis of unknown glycoproteins, especially without prior knowledge of the proteins and glycans present in the sample. To ensure accurate Y1-ion assignment, we propose a novel algorithm called Trident that detects a triplet pattern corresponding to [Y0, Y1, Y2] or [Y0-NH3, Y0, Y1] from the fragmentation of the common trimannosyl core of N-linked glycopeptides. To facilitate the subsequent peptide sequence identification by common database search engines, MAGIC generates in silico spectra by overwriting the original precursor with the naked peptide m/z and removing all of the glycan-related ions. Finally, MAGIC computes the glycan compositions and ranks them. For the model glycoprotein horseradish peroxidase (HRP) and a 5-glycoprotein mixture, a 2- to 31-fold increase in the relative intensities of the peptide fragments was achieved, which led to the identification of 7 tryptic glycopeptides from HRP and 16 glycopeptides from the mixture via Mascot. In the HeLa cell proteome data set, MAGIC processed over a thousand MS(2) spectra in 3 min on a PC and reported 36 glycopeptides from 26 glycoproteins. Finally, a remarkable false discovery rate of 0 was achieved on the N-glycosylation-free Escherichia coli data set. MAGIC is available at http://ms.iis.sinica.edu.tw/COmics/Software_MAGIC.html .


Asunto(s)
Algoritmos , Biología Computacional , Glicopéptidos/análisis , Programas Informáticos , Automatización , Bases de Datos Factuales , Escherichia coli/química , Glicopéptidos/química , Células HeLa , Humanos
19.
J Biomed Inform ; 58 Suppl: S150-S157, 2015 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-26432355

RESUMEN

Electronic medical records (EMRs) for diabetic patients contain information about heart disease risk factors such as high blood pressure, cholesterol levels, and smoking status. Discovering the described risk factors and tracking their progression over time may support medical personnel in making clinical decisions, as well as facilitate data modeling and biomedical research. Such highly patient-specific knowledge is essential to driving the advancement of evidence-based practice, and can also help improve personalized medicine and care. One general approach for tracking the progression of diseases and their risk factors described in EMRs is to first recognize all temporal expressions, and then assign each of them to the nearest target medical concept. However, this method may not always provide the correct associations. In light of this, this work introduces a context-aware approach to assign the time attributes of the recognized risk factors by reconstructing contexts that contain more reliable temporal expressions. The evaluation results on the i2b2 test set demonstrate the efficacy of the proposed approach, which achieved an F-score of 0.897. To boost the approach's ability to process unstructured clinical text and to allow for the reproduction of the demonstrated results, a set of developed .NET libraries used to develop the system is available at https://sites.google.com/site/hongjiedai/projects/nttmuclinicalnet.


Asunto(s)
Enfermedades Cardiovasculares/epidemiología , Minería de Datos/métodos , Complicaciones de la Diabetes/epidemiología , Registros Electrónicos de Salud/organización & administración , Narración , Procesamiento de Lenguaje Natural , Anciano , Enfermedades Cardiovasculares/diagnóstico , Estudios de Cohortes , Comorbilidad , Seguridad Computacional , Confidencialidad , Complicaciones de la Diabetes/diagnóstico , Progresión de la Enfermedad , Femenino , Humanos , Incidencia , Estudios Longitudinales , Masculino , Persona de Mediana Edad , Reconocimiento de Normas Patrones Automatizadas/métodos , Medición de Riesgo/métodos , Taiwán/epidemiología , Vocabulario Controlado
20.
BMC Bioinformatics ; 14: 304, 2013 Oct 11.
Artículo en Inglés | MEDLINE | ID: mdl-24112406

RESUMEN

BACKGROUND: Since membrane protein structures are challenging to crystallize, computational approaches are essential for elucidating the sequence-to-structure relationships. Structural modeling of membrane proteins requires a multidimensional approach, and one critical geometric parameter is the rotational angle of transmembrane helices. Rotational angles of transmembrane helices are characterized by their folded structures and could be inferred by the hydrophobic moment; however, the folding mechanism of membrane proteins is not yet fully understood. The rotational angle of a transmembrane helix is related to the exposed surface of a transmembrane helix, since lipid exposure gives the degree of accessibility of each residue in lipid environment. To the best of our knowledge, there have been few advances in investigating whether an environment descriptor of lipid exposure could infer a geometric parameter of rotational angle. RESULTS: Here, we present an analysis of the relationship between rotational angles and lipid exposure and a support-vector-machine method, called TMexpo, for predicting both structural features from sequences. First, we observed from the development set of 89 protein chains that the lipid exposure, i.e., the relative accessible surface area (rASA) of residues in the lipid environment, generated from high-resolution protein structures could infer the rotational angles with a mean absolute angular error (MAAE) of 46.32˚. More importantly, the predicted rASA from TMexpo achieved an MAAE of 51.05˚, which is better than 71.47˚ obtained by the best of the compared hydrophobicity scales. Lastly, TMexpo outperformed the compared methods in rASA prediction on the independent test set of 21 protein chains and achieved an overall Matthew's correlation coefficient, accuracy, sensitivity, specificity, and precision of 0.51, 75.26%, 81.30%, 69.15%, and 72.73%, respectively. TMexpo is publicly available at http://bio-cluster.iis.sinica.edu.tw/TMexpo. CONCLUSIONS: TMexpo can better predict rASA and rotational angles than the compared methods. When rotational angles can be accurately predicted, free modeling of transmembrane protein structures in turn may benefit from a reduced complexity in ensembles with a significantly less number of packing arrangements. Furthermore, sequence-based prediction of both rotational angle and lipid exposure can provide essential information when high-resolution structures are unavailable and contribute to experimental design to elucidate transmembrane protein functions.


Asunto(s)
Biología Computacional/métodos , Lípidos de la Membrana/química , Proteínas de la Membrana/química , Secuencia de Aminoácidos , Interacciones Hidrofóbicas e Hidrofílicas , Lípidos de la Membrana/metabolismo , Proteínas de la Membrana/metabolismo , Datos de Secuencia Molecular , Estructura Secundaria de Proteína , Máquina de Vectores de Soporte
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA