Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 65
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
País de afiliación
Intervalo de año de publicación
1.
BMC Bioinformatics ; 24(1): 78, 2023 Mar 04.
Artículo en Inglés | MEDLINE | ID: mdl-36870946

RESUMEN

BACKGROUND: The Kyoto Encyclopedia of Genes and Genomes (KEGG) provides organized genomic, biomolecular, and metabolic information and knowledge that is reasonably current and highly useful for a wide range of analyses and modeling. KEGG follows the principles of data stewardship to be findable, accessible, interoperable, and reusable (FAIR) by providing RESTful access to their database entries via their web-accessible KEGG API. However, the overall FAIRness of KEGG is often limited by the library and software package support available in a given programming language. While R library support for KEGG is fairly strong, Python library support has been lacking. Moreover, there is no software that provides extensive command line level support for KEGG access and utilization. RESULTS: We present kegg_pull, a package implemented in the Python programming language that provides better KEGG access and utilization functionality than previous libraries and software packages. Not only does kegg_pull include an application programming interface (API) for Python programming, it also provides a command line interface (CLI) that enables utilization of KEGG for a wide range of shell scripting and data analysis pipeline use-cases. As kegg_pull's name implies, both the API and CLI provide versatile options for pulling (downloading and saving) an arbitrary (user defined) number of database entries from the KEGG API. Moreover, this functionality is implemented to efficiently utilize multiple central processing unit cores as demonstrated in several performance tests. Many options are provided to optimize fault-tolerant performance across a single or multiple processes, with recommendations provided based on extensive testing and practical network considerations. CONCLUSIONS: The new kegg_pull package enables new flexible KEGG retrieval use cases not available in previous software packages. The most notable new feature that kegg_pull provides is its ability to robustly pull an arbitrary number of KEGG entries with a single API method or CLI command, including pulling an entire KEGG database. We provide recommendations to users for the most effective use of kegg_pull according to their network and computational circumstances.


Asunto(s)
Análisis de Datos , Genómica , Biblioteca de Genes , Bases de Datos Factuales , Conocimiento
2.
BMC Bioinformatics ; 24(1): 299, 2023 Jul 24.
Artículo en Inglés | MEDLINE | ID: mdl-37482620

RESUMEN

BACKGROUND: An updated version of the mwtab Python package for programmatic access to the Metabolomics Workbench (MetabolomicsWB) data repository was released at the beginning of 2021. Along with updating the package to match the changes to MetabolomicsWB's 'mwTab' file format specification and enhancing the package's functionality, the included validation facilities were used to detect and catalog file inconsistencies and errors across all publicly available datasets in MetabolomicsWB. RESULTS: The MetabolomicsWB File Status website was developed to provide continuous validation of MetabolomicsWB data files and a useful interface to all found inconsistencies and errors. This list of detectable issues/errors include format parsing errors, format compliance issues, access problems via MetabolomicsWB's REST interface, and other small inconsistencies that can hinder reusability. The website uses the mwtab Python package to pull down and validate each available analysis file and then generates an html report. The website is updated on a weekly basis. Moreover, the Python website design utilizes GitHub and GitHub.io, providing an easy to replicate template for implementing other metadata, virtual, and meta- repositories. CONCLUSIONS: The MetabolomicsWB File Status website provides a metadata repository of validation metadata to promote the FAIR use of existing metabolomics datasets from the MetabolomicsWB data repository.


Asunto(s)
Metadatos , Programas Informáticos , Metabolómica , Almacenamiento y Recuperación de la Información
3.
Hepatology ; 76(5): 1376-1388, 2022 11.
Artículo en Inglés | MEDLINE | ID: mdl-35313030

RESUMEN

BACKGROUND AND AIMS: Resolution of pathways that converge to induce deleterious effects in hepatic diseases, such as in the later stages, have potential antifibrotic effects that may improve outcomes. We aimed to explore whether humans and rodents display similar fibrotic signaling networks. APPROACH AND RESULTS: We assiduously mapped kinase pathways using 340 substrate targets, upstream bioinformatic analysis of kinase pathways, and over 2000 random sampling iterations using the PamGene PamStation kinome microarray chip technology. Using this technology, we characterized a large number of kinases with altered activity in liver fibrosis of both species. Gene expression and immunostaining analyses validated many of these kinases as bona fide signaling events. Surprisingly, the insulin receptor emerged as a considerable protein tyrosine kinase that is hyperactive in fibrotic liver disease in humans and rodents. Discoidin domain receptor tyrosine kinase, activated by collagen that increases during fibrosis, was another hyperactive protein tyrosine kinase in humans and rodents with fibrosis. The serine/threonine kinases found to be the most active in fibrosis were dystrophy type 1 protein kinase and members of the protein kinase family of kinases. We compared the fibrotic events over four models: humans with cirrhosis and three murine models with differing levels of fibrosis, including two models of fatty liver disease with emerging fibrosis. The data demonstrate a high concordance between human and rodent hepatic kinome signaling that focalizes, as shown by our network analysis of detrimental pathways. CONCLUSIONS: Our findings establish a comprehensive kinase atlas for liver fibrosis, which identifies analogous signaling events conserved among humans and rodents.


Asunto(s)
Hepatopatías , Receptor de Insulina , Humanos , Ratones , Animales , Receptor de Insulina/metabolismo , Roedores , Cirrosis Hepática/patología , Hígado/patología , Hepatopatías/patología , Fibrosis , Proteínas Quinasas/metabolismo , Colágeno/metabolismo , Serina/metabolismo , Receptores con Dominio Discoidina/metabolismo , Treonina/metabolismo
4.
Bioinformatics ; 37(9): 1189-1197, 2021 06 09.
Artículo en Inglés | MEDLINE | ID: mdl-33165532

RESUMEN

MOTIVATION: Cancer somatic driver mutations associated with genes within a pathway often show a mutually exclusive pattern across a cohort of patients. This mutually exclusive mutational signal has been frequently used to distinguish driver from passenger mutations and to investigate relationships among driver mutations. Current methods for de novo discovery of mutually exclusive mutational patterns are limited because the heterogeneity in background mutation rate can confound mutational patterns, and the presence of highly mutated genes can lead to spurious patterns. In addition, most methods only focus on a limited number of pre-selected genes and are unable to perform genome-wide analysis due to computational inefficiency. RESULTS: We introduce a statistical framework, MEScan, for accurate and efficient mutual exclusivity analysis at the genomic scale. Our framework contains a fast and powerful statistical test for mutual exclusivity with adjustment of the background mutation rate and impact of highly mutated genes, and a multi-step procedure for genome-wide screening with the control of false discovery rate. We demonstrate that MEScan more accurately identifies mutually exclusive gene sets than existing methods and is at least two orders of magnitude faster than most methods. By applying MEScan to data from four different cancer types and pan-cancer, we have identified several biologically meaningful mutually exclusive gene sets. AVAILABILITY AND IMPLEMENTATION: MEScan is available as an R package at https://github.com/MarkeyBBSRF/MEScan. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Biología Computacional , Neoplasias , Algoritmos , Genómica , Humanos , Mutación , Neoplasias/genética
5.
J Proteome Res ; 20(5): 2904-2913, 2021 05 07.
Artículo en Inglés | MEDLINE | ID: mdl-33830777

RESUMEN

The gut microbiome generates numerous metabolites that exert local effects and enter the circulation to affect the functions of many organs. Despite extensive sequencing-based characterization of the gut microbiome, there remains a lack of understanding of microbial metabolism. Here, we developed an untargeted stable isotope-resolved metabolomics (SIRM) approach for the holistic study of gut microbial metabolites. Viable microbial cells were extracted from fresh mice feces and incubated anaerobically with 13C-labeled dietary fibers including inulin or cellulose. High-resolution mass spectrometry was used to monitor 13C enrichment in metabolites associated with glycolysis, the Krebs cycle, the pentose phosphate pathway, nucleotide synthesis, and pyruvate catabolism in both microbial cells and the culture medium. We observed the differential use of inulin and cellulose as substrates for biosynthesis of essential and non-essential amino acids, neurotransmitters, vitamin B5, and other coenzymes. Specifically, the use of inulin for these biosynthetic pathways was markedly more efficient than the use of cellulose, reflecting distinct metabolic pathways of dietary fibers in the gut microbiome, which could be related with host effects. This technology facilitates deeper and holistic insights into the metabolic function of the gut microbiome (Metabolomic Workbench Study ID: ST001651).


Asunto(s)
Microbioma Gastrointestinal , Metaboloma , Animales , Fibras de la Dieta , Heces , Isótopos , Metabolómica , Ratones
6.
Bioinformatics ; 36(10): 3207-3214, 2020 05 01.
Artículo en Inglés | MEDLINE | ID: mdl-32065617

RESUMEN

MOTIVATION: The Gene Ontology (GO) is the unifying biological vocabulary for codifying, managing and sharing biological knowledge. Quality issues in GO, if not addressed, can cause misleading results or missed biological discoveries. Manual identification of potential quality issues in GO is a challenging and arduous task, given its growing size. We introduce an automated auditing approach for suggesting potentially missing is-a relations, which may further reveal erroneous is-a relations. RESULTS: We developed a Subsumption-based Sub-term Inference Framework (SSIF) by leveraging a novel term-algebra on top of a sequence-based representation of GO concepts along with three conditional rules (monotonicity, intersection and sub-concept rules). Applying SSIF to the October 3, 2018 release of GO suggested 1938 unique potentially missing is-a relations. Domain experts evaluated a random sample of 210 potentially missing is-a relations. The results showed SSIF achieved a precision of 60.61, 60.49 and 46.03% for the monotonicity, intersection and sub-concept rules, respectively. AVAILABILITY AND IMPLEMENTATION: SSIF is implemented in Java. The source code is available at https://github.com/rashmie/SSIF. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Programas Informáticos , Ontología de Genes
7.
BMC Genomics ; 21(1): 277, 2020 Apr 03.
Artículo en Inglés | MEDLINE | ID: mdl-32245406

RESUMEN

BACKGROUND: Spermatogenesis is the process by which germ cells develop into spermatozoa in the testis. Sperm protamines are small, arginine-rich nuclear proteins which replace somatic histones during spermatogenesis, allowing a hypercondensed DNA state that leads to a smaller nucleus and facilitating sperm head formation. In eutherian mammals, the protamine-DNA complex is achieved through a combination of intra- and intermolecular cysteine cross-linking and possibly histidine-cysteine zinc ion binding. Most metatherian sperm protamines lack cysteine but perform the same function. This lack of dicysteine cross-linking has made the mechanism behind metatherian protamines folding unclear. RESULTS: Protamine sequences from UniProt's databases were pulled down and sorted into homologous groups. Multiple sequence alignments were then generated and a gap weighted relative entropy score calculated for each position. For the eutherian alignments, the cysteine containing positions were the most highly conserved. For the metatherian alignment, the tyrosine containing positions were the most highly conserved and corresponded to the cysteine positions in the eutherian alignment. CONCLUSIONS: High conservation indicates likely functionally/structurally important residues at these positions in the metatherian protamines and the correspondence with cysteine positions within the eutherian alignment implies a similarity in function. One possible explanation is that the metatherian protamine structure relies upon dityrosine cross-linking between these highly conserved tyrosines. Also, the human protamine P1 sequence has a tyrosine substitution in a position expecting eutherian dicysteine cross-linking. Similarly, some members of the metatherian Planigales genus contain cysteine substitutions in positions expecting plausible metatherian dityrosine cross-linking. Rare cysteine-tyrosine cross-linking could explain both observations.


Asunto(s)
Biología Computacional/métodos , ADN/metabolismo , Protaminas/química , Protaminas/metabolismo , Espermatozoides/metabolismo , Secuencia de Aminoácidos , Animales , Sitios de Unión , Secuencia Conservada , Cisteína/metabolismo , Entropía , Euterios , Masculino , Protaminas/genética , Unión Proteica , Pliegue de Proteína , Alineación de Secuencia , Tirosina/análogos & derivados , Tirosina/metabolismo
8.
BMC Bioinformatics ; 20(1): 524, 2019 Oct 28.
Artículo en Inglés | MEDLINE | ID: mdl-31660850

RESUMEN

BACKGROUND: Stable isotope tracing can follow individual atoms through metabolic transformations through the detection of the incorporation of stable isotope within metabolites. This resulting data can be interpreted in terms related to metabolic flux. However, detection of a stable isotope in metabolites by mass spectrometry produces a profile of isotopologue peaks that requires deconvolution to ascertain the localization of isotope incorporation. RESULTS: To aid the interpretation of the mass spectroscopy isotopologue profile, we have developed a moiety modeling framework for deconvoluting metabolite isotopologue profiles involving single and multiple isotope tracers. This moiety modeling framework provides facilities for moiety model representation, moiety model optimization, and moiety model selection. The moiety_modeling package was developed from the idea of metabolite decomposition into moiety units based on metabolic transformations, i.e. a moiety model. The SAGA-optimize package, solving a boundary-value inverse problem through a combined simulated annealing and genetic algorithm, was developed for model optimization. Additional optimization methods from the Python scipy library are utilized as well. Several forms of the Akaike information criterion and Bayesian information criterion are provided for selecting between moiety models. Moiety models and associated isotopologue data are defined in a JSONized format. By testing the moiety modeling framework on the timecourses of 13C isotopologue data for uridine diphosphate N-acetyl-D-glucosamine (UDP-GlcNAc) in human prostate cancer LnCaP-LN3 cells, we were able to confirm its robust performance in isotopologue deconvolution and moiety model selection. CONCLUSIONS: SAGA-optimize is a useful Python package for solving boundary-value inverse problems, and the moiety_modeling package is an easy-to-use tool for mass spectroscopy isotopologue profile deconvolution involving single and multiple isotope tracers. Both packages are freely available on GitHub and via the Python Package Index.


Asunto(s)
Metabolómica , Teorema de Bayes , Isótopos de Carbono/análisis , Línea Celular Tumoral , Humanos , Marcaje Isotópico , Masculino , Espectrometría de Masas/métodos , Metabolómica/métodos , Neoplasias de la Próstata
9.
Anal Chem ; 91(14): 8933-8940, 2019 07 16.
Artículo en Inglés | MEDLINE | ID: mdl-31260262

RESUMEN

Improvements in Fourier transform mass spectrometry (FT-MS) enable increasingly more complex experiments in the field of metabolomics. What is directly detected in FT-MS spectra are spectral features (peaks) that correspond to sets of adducted and charged forms of specific molecules in the sample. The robust assignment of these features is an essential step for MS-based metabolomics experiments, but the sheer complexity of what is detected and a variety of analytically introduced variance, errors, and artifacts has hindered the systematic analysis of complex patterns of observed peaks with respect to isotope content. We have developed a method called SMIRFE that detects small biomolecules and determines their elemental molecular formula (EMF) using detected sets of isotopologue peaks sharing the same EMF. SMIRFE does not use a database of known metabolite formulas; instead a nearly comprehensive search space of all isotopologues within a mass range is constructed and used for assignment. This search space can be tailored for different isotope labeling patterns expected in different stable isotope tracing experiments. Using consumer-level computing equipment, a large search space of 2000 Da was constructed, and assignment performance was evaluated and validated using verified assignments on a pair of peak lists derived from spectra containing unlabeled and 15N-labeled versions of amino acids derivatized using ethylchloroformate. SMIRFE identified 18 of 18 predicted derivatized EMFs, and each assignment was evaluated statistically and assigned an e-value representing the probability to occur by chance.


Asunto(s)
Aminoácidos/análisis , Espectrometría de Masas/métodos , Algoritmos , Isótopos de Carbono/análisis , Análisis de Fourier , Marcaje Isotópico/métodos , Metabolómica/métodos , Isótopos de Nitrógeno/análisis
10.
Molecules ; 24(17)2019 Sep 01.
Artículo en Inglés | MEDLINE | ID: mdl-31480623

RESUMEN

As the number of macromolecular structures in the worldwide Protein Data Bank (wwPDB) continues to grow rapidly, more attention is being paid to the quality of its data, especially for use in aggregated structural and dynamics analyses. In this study, we systematically analyzed 3.5 Å regions around all metal ions across all PDB entries with supporting electron density maps available from the PDB in Europe. All resulting metal ion-centric regions were evaluated with respect to four quality-control criteria involving electron density resolution, atom occupancy, symmetry atom exclusion, and regional electron density discrepancy. The resulting list of metal binding sites passing all four criteria possess high regional structural quality and should be beneficial to a wide variety of downstream analyses. This study demonstrates an approach for the pan-PDB evaluation of metal binding site structural quality with respect to underlying X-ray crystallographic experimental data represented in the available electron density maps of proteins. For non-crystallographers in particular, we hope to change the focus and discussion of structural quality from a global evaluation to a regional evaluation, since all structural entries in the wwPDB appear to have both regions of high and low structural quality.


Asunto(s)
Bases de Datos de Proteínas , Metales/química , Cristalografía por Rayos X , Iones , Electricidad Estática
11.
J Biomol NMR ; 72(1-2): 11-28, 2018 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-30097912

RESUMEN

Poor chemical shift referencing, especially for 13C in protein Nuclear Magnetic Resonance (NMR) experiments, fundamentally limits and even prevents effective study of biomacromolecules via NMR, including protein structure determination and analysis of protein dynamics. To solve this problem, we constructed a Bayesian probabilistic framework that circumvents the limitations of previous reference correction methods that required protein resonance assignment and/or three-dimensional protein structure. Our algorithm named Bayesian Model Optimized Reference Correction (BaMORC) can detect and correct 13C chemical shift referencing errors before the protein resonance assignment step of analysis and without three-dimensional structure. By combining the BaMORC methodology with a new intra-peaklist grouping algorithm, we created a combined method called Unassigned BaMORC that utilizes only unassigned experimental peak lists and the amino acid sequence. Unassigned BaMORC kept all experimental three-dimensional HN(CO)CACB-type peak lists tested within ± 0.4 ppm of the correct 13C reference value. On a much larger unassigned chemical shift test set, the base method kept 13C chemical shift referencing errors to within ± 0.45 ppm at a 90% confidence interval. With chemical shift assignments, Assigned BaMORC can detect and correct 13C chemical shift referencing errors to within ± 0.22 at a 90% confidence interval. Therefore, Unassigned BaMORC can correct 13C chemical shift referencing errors when it will have the most impact, right before protein resonance assignment and other downstream analyses are started. After assignment, chemical shift reference correction can be further refined with Assigned BaMORC. These new methods will allow non-NMR experts to detect and correct 13C referencing error at critical early data analysis steps, lowering the bar of NMR expertise required for effective protein NMR analysis.


Asunto(s)
Algoritmos , Resonancia Magnética Nuclear Biomolecular/métodos , Proteínas/química , Teorema de Bayes , Isótopos de Carbono
12.
Metabolomics ; 14(10): 125, 2018 09 17.
Artículo en Inglés | MEDLINE | ID: mdl-30830442

RESUMEN

INTRODUCTION: Direct injection Fourier-transform mass spectrometry (FT-MS) allows for the high-throughput and high-resolution detection of thousands of metabolite-associated isotopologues. However, spectral artifacts can generate large numbers of spectral features (peaks) that do not correspond to known compounds. Misassignment of these artifactual features creates interpretive errors and limits our ability to discern the role of representative features within living systems. OBJECTIVES: Our goal is to develop rigorous methods that identify and handle spectral artifacts within the context of high-throughput FT-MS-based metabolomics studies. RESULTS: We observed three types of artifacts unique to FT-MS that we named high peak density (HPD) sites: fuzzy sites, ringing and partial ringing. While ringing artifacts are well-known, fuzzy sites and partial ringing have not been previously well-characterized in the literature. We developed new computational methods based on comparisons of peak density within a spectrum to identify regions of spectra with fuzzy sites. We used these methods to identify and eliminate fuzzy site artifacts in an example dataset of paired cancer and non-cancer lung tissue samples and evaluated the impact of these artifacts on classification accuracy and robustness. CONCLUSION: Our methods robustly identified consistent fuzzy site artifacts in our FT-MS metabolomics spectral data. Without artifact identification and removal, 91.4% classification accuracy was achieved on an example lung cancer dataset; however, these classifiers rely heavily on artifactual features present in fuzzy sites. Proper removal of fuzzy site artifacts produces a more robust classifier based on non-artifactual features, with slightly improved accuracy of 92.4% in our example analysis.


Asunto(s)
Carcinoma de Pulmón de Células no Pequeñas/metabolismo , Análisis de Fourier , Ensayos Analíticos de Alto Rendimiento , Neoplasias Pulmonares/metabolismo , Pulmón/metabolismo , Espectrometría de Masas , Metabolómica , Carcinoma de Pulmón de Células no Pequeñas/diagnóstico , Humanos , Neoplasias Pulmonares/diagnóstico
13.
BMC Bioinformatics ; 18(1): 175, 2017 Mar 17.
Artículo en Inglés | MEDLINE | ID: mdl-28302053

RESUMEN

BACKGROUND: The Biological Magnetic Resonance Data Bank (BMRB) is a public repository of Nuclear Magnetic Resonance (NMR) spectroscopic data of biological macromolecules. It is an important resource for many researchers using NMR to study structural, biophysical, and biochemical properties of biological macromolecules. It is primarily maintained and accessed in a flat file ASCII format known as NMR-STAR. While the format is human readable, the size of most BMRB entries makes computer readability and explicit representation a practical requirement for almost any rigorous systematic analysis. RESULTS: To aid in the use of this public resource, we have developed a package called nmrstarlib in the popular open-source programming language Python. The nmrstarlib's implementation is very efficient, both in design and execution. The library has facilities for reading and writing both NMR-STAR version 2.1 and 3.1 formatted files, parsing them into usable Python dictionary- and list-based data structures, making access and manipulation of the experimental data very natural within Python programs (i.e. "saveframe" and "loop" records represented as individual Python dictionary data structures). Another major advantage of this design is that data stored in original NMR-STAR can be easily converted into its equivalent JavaScript Object Notation (JSON) format, a lightweight data interchange format, facilitating data access and manipulation using Python and any other programming language that implements a JSON parser/generator (i.e., all popular programming languages). We have also developed tools to visualize assigned chemical shift values and to convert between NMR-STAR and JSONized NMR-STAR formatted files. Full API Reference Documentation, User Guide and Tutorial with code examples are also available. We have tested this new library on all current BMRB entries: 100% of all entries are parsed without any errors for both NMR-STAR version 2.1 and version 3.1 formatted files. We also compared our software to three currently available Python libraries for parsing NMR-STAR formatted files: PyStarLib, NMRPyStar, and PyNMRSTAR. CONCLUSIONS: The nmrstarlib package is a simple, fast, and efficient library for accessing data from the BMRB. The library provides an intuitive dictionary-based interface with which Python programs can read, edit, and write NMR-STAR formatted files and their equivalent JSONized NMR-STAR files. The nmrstarlib package can be used as a library for accessing and manipulating data stored in NMR-STAR files and as a command-line tool to convert from NMR-STAR file format into its equivalent JSON file format and vice versa, and to visualize chemical shift values. Furthermore, the nmrstarlib implementation provides a guide for effectively JSONizing other older scientific formats, improving the FAIRness of data in these formats.


Asunto(s)
Bases de Datos Factuales , Programas Informáticos , Espectroscopía de Resonancia Magnética
14.
Proteins ; 85(5): 885-907, 2017 05.
Artículo en Inglés | MEDLINE | ID: mdl-28142195

RESUMEN

Metalloproteins bind and utilize metal ions for a variety of biological purposes. Due to the ubiquity of metalloprotein involvement throughout these processes across all domains of life, how proteins coordinate metal ions for different biochemical functions is of great relevance to understanding the implementation of these biological processes. Toward these ends, we have improved our methodology for structurally and functionally characterizing metal binding sites in metalloproteins. Our new ligand detection method is statistically much more robust, producing estimated false positive and false negative rates of ∼0.11% and ∼1.2%, respectively. Additional improvements expand both the range of metal ions and their coordination number that can be effectively analyzed. Also, the inclusion of additional quality control filters has significantly improved structure-function Spearman correlations as demonstrated by rho values greater than 0.90 for several metal coordination analyses and even one rho value above 0.95. Also, improvements in bond-length distributions have revealed bond-length modes specific to chemical functional groups involved in multidentation. Using these improved methods, we analyzed all single metal ion binding sites with Zn, Mg, Ca, Fe, and Na ions in the wwPDB, producing statistically rigorous results supporting the existence of both a significant number of unexpected compressed angles and subsequent aberrant metal ion coordination geometries (CGs) within structurally known metalloproteins. By recognizing these aberrant CGs in our clustering analyses, high correlations are achieved between structural and functional descriptions of metal ion coordination. Moreover, distinct biochemical functions are associated with aberrant CGs versus nonaberrant CGs. Proteins 2017; 85:885-907. © 2016 Wiley Periodicals, Inc.


Asunto(s)
Calcio/química , Complejos de Coordinación/química , Hierro/química , Magnesio/química , Metaloproteínas/química , Sodio/química , Zinc/química , Sitios de Unión , Cationes Bivalentes , Cationes Monovalentes , Análisis por Conglomerados , Bases de Datos de Proteínas , Unión Proteica , Conformación Proteica , Proteómica/métodos
15.
Proteins ; 85(5): 938-944, 2017 05.
Artículo en Inglés | MEDLINE | ID: mdl-28168746

RESUMEN

Recent papers highlight the presence of large numbers of compressed angles in metal ion coordination geometries for metalloprotein entries in the worldwide Protein Data Bank, due mainly to multidentate coordination. The prevalence of these compressed angles has raised the controversial idea that significantly populated aberrant or even novel coordination geometries may exist. Some of these papers have undergone severe criticism, apparently due to views held that only canonical coordination geometries exist in significant numbers. While criticism of controversial ideas is warranted and to be expected, we believe that a line was crossed where unfair criticism was put forth to discredit an inconvenient result that compressed angles exist in large numbers, which does not support the dogmatic canonical coordination geometry view. We present a review of the major controversial results and their criticisms, pointing out both good suggestions that have been incorporated in new analyses, but also unfair criticism that was put forth to support a particular view. We also suggest that better science is enabled through: (i) a more collegial and collaborative approach in future critical reviews and (ii) the requirement for a description of methods and data including source code and visualizations that enables full reproducibility of results. Proteins 2017; 85:938-944. © 2016 Wiley Periodicals, Inc.


Asunto(s)
Biología Computacional/ética , Complejos de Coordinación/química , Difusión de la Información/ética , Metaloproteínas/química , Metales/química , Biología Computacional/tendencias , Bases de Datos de Proteínas , Humanos , Conformación Proteica , Reproducibilidad de los Resultados
16.
J Biomol NMR ; 68(4): 281-296, 2017 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-28815397

RESUMEN

Peak lists derived from nuclear magnetic resonance (NMR) spectra are commonly used as input data for a variety of computer assisted and automated analyses. These include automated protein resonance assignment and protein structure calculation software tools. Prior to these analyses, peak lists must be aligned to each other and sets of related peaks must be grouped based on common chemical shift dimensions. Even when programs can perform peak grouping, they require the user to provide uniform match tolerances or use default values. However, peak grouping is further complicated by multiple sources of variance in peak position limiting the effectiveness of grouping methods that utilize uniform match tolerances. In addition, no method currently exists for deriving peak positional variances from single peak lists for grouping peaks into spin systems, i.e. spin system grouping within a single peak list. Therefore, we developed a complementary pair of peak list registration analysis and spin system grouping algorithms designed to overcome these limitations. We have implemented these algorithms into an approach that can identify multiple dimension-specific positional variances that exist in a single peak list and group peaks from a single peak list into spin systems. The resulting software tools generate a variety of useful statistics on both a single peak list and pairwise peak list alignment, especially for quality assessment of peak list datasets. We used a range of low and high quality experimental solution NMR and solid-state NMR peak lists to assess performance of our registration analysis and grouping algorithms. Analyses show that an algorithm using a single iteration and uniform match tolerances approach is only able to recover from 50 to 80% of the spin systems due to the presence of multiple sources of variance. Our algorithm recovers additional spin systems by reevaluating match tolerances in multiple iterations. To facilitate evaluation of the algorithms, we developed a peak list simulator within our nmrstarlib package that generates user-defined assigned peak lists from a given BMRB entry or database of entries. In addition, over 100,000 simulated peak lists with one or two sources of variance were generated to evaluate the performance and robustness of these new registration analysis and peak grouping algorithms.


Asunto(s)
Resonancia Magnética Nuclear Biomolecular/métodos , Proteínas/química , Algoritmos , Modelos Moleculares , Conformación Proteica , Proteínas/análisis , Programas Informáticos , Soluciones
17.
Proteins ; 83(8): 1470-87, 2015 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-26009987

RESUMEN

Zinc metalloproteins are involved in many biological processes and play crucial biochemical roles across all domains of life. Local structure around the zinc ion, especially the coordination geometry (CG), is dictated by the protein sequence and is often directly related to the function of the protein. Current methodologies in characterizing zinc metalloproteins' CG consider only previously reported CG models based mainly on nonbiological chemical context. Exceptions to these canonical CG models are either misclassified or discarded as "outliers." Thus, we developed a less-biased method that directly handles potential exceptions without pre-assuming any CG model. Our study shows that numerous exceptions could actually be further classified and that new CG models are needed to characterize them. Also, these new CG models are cross-validated by strong correlation between independent structural and functional annotation distance metrics, which is partially lost if these new CGs models are ignored. Furthermore, these new CG models exhibit functional propensities distinct from the canonical CG models.


Asunto(s)
Biología Computacional/métodos , Metaloproteínas/química , Zinc/química , Algoritmos , Metaloproteínas/metabolismo , Relación Estructura-Actividad , Zinc/metabolismo
18.
Magn Reson Chem ; 53(5): 337-43, 2015 May.
Artículo en Inglés | MEDLINE | ID: mdl-25616249

RESUMEN

NMR spectra of mixtures of metabolites extracted from cells or tissues are extremely complex, reflecting the large number of compounds that are present over a wide range of concentrations. Although multidimensional NMR can greatly improve resolution as well as improve reliability of compound assignments, lower abundance metabolites often remain hidden. We have developed a carbonyl-selective aminooxy probe that specifically reacts with free keto and aldehyde functions, but not carboxylates. By incorporating (15)N in the aminooxy functional group, (15)N-edited NMR was used to select exclusively those metabolites that contain a free carbonyl function while all other metabolites are rejected. Here, we demonstrate that the chemical shifts of the aminooxy adducts of ketones and aldehydes are very different, which can be used to discriminate between aldoses and ketoses, for example. Utilizing the 2-bond or 3-bond (15)N-(1)H couplings, the (15)N-edited NMR analysis was optimized first with authentic standards and then applied to an extract of the lung adenocarcinoma cell line A549. More than 30 carbonyl-containing compounds at NMR-detectable levels, six of which we have assigned by reference to our database. As the aminooxy probe contains a permanently charged quaternary ammonium group, the adducts are also optimized for detection by mass spectrometry. Thus, this sample preparation technique provides a better link between the two structural determination tools, thereby paving the way to faster and more reliable identification of both known and unknown metabolites directly in crude biological extracts.


Asunto(s)
Aldehídos/metabolismo , Cetonas/metabolismo , Neoplasias Pulmonares/metabolismo , Espectroscopía de Protones por Resonancia Magnética/métodos , Aldehídos/química , Línea Celular Tumoral , Humanos , Cetonas/química , Neoplasias Pulmonares/química , Técnicas de Sonda Molecular , Isótopos de Nitrógeno/análisis , Isótopos de Nitrógeno/química , Reproducibilidad de los Resultados , Sensibilidad y Especificidad
19.
PLoS One ; 19(5): e0299583, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38696410

RESUMEN

The mapping of metabolite-specific data to pathways within cellular metabolism is a major data analysis step needed for biochemical interpretation. A variety of machine learning approaches, particularly deep learning approaches, have been used to predict these metabolite-to-pathway mappings, utilizing a training dataset of known metabolite-to-pathway mappings. A few such training datasets have been derived from the Kyoto Encyclopedia of Genes and Genomes (KEGG). However, several prior published machine learning approaches utilized an erroneous KEGG-derived training dataset that used SMILES molecular representations strings (KEGG-SMILES dataset) and contained a sizable proportion (~26%) duplicate entries. The presence of so many duplicates taint the training and testing sets generated from k-fold cross-validation of the KEGG-SMILES dataset. Therefore, the k-fold cross-validation performance of the resulting machine learning models was grossly inflated by the erroneous presence of these duplicate entries. Here we describe and evaluate the KEGG-SMILES dataset so that others may avoid using it. We also identify the prior publications that utilized this erroneous KEGG-SMILES dataset so their machine learning results can be properly and critically evaluated. In addition, we demonstrate the reduction of model k-fold cross-validation (CV) performance after de-duplicating the KEGG-SMILES dataset. This is a cautionary tale about properly vetting prior published benchmark datasets before using them in machine learning approaches. We hope others will avoid similar mistakes.


Asunto(s)
Redes y Vías Metabólicas , Aprendizaje Automático Supervisado , Humanos , Conjuntos de Datos como Asunto
20.
Metabolites ; 14(5)2024 May 07.
Artículo en Inglés | MEDLINE | ID: mdl-38786743

RESUMEN

A major limitation of most metabolomics datasets is the sparsity of pathway annotations for detected metabolites. It is common for less than half of the identified metabolites in these datasets to have a known metabolic pathway involvement. Trying to address this limitation, machine learning models have been developed to predict the association of a metabolite with a "pathway category", as defined by a metabolic knowledge base like KEGG. Past models were implemented as a single binary classifier specific to a single pathway category, requiring a set of binary classifiers for generating the predictions for multiple pathway categories. This past approach multiplied the computational resources necessary for training while diluting the positive entries in the gold standard datasets needed for training. To address these limitations, we propose a generalization of the metabolic pathway prediction problem using a single binary classifier that accepts the features both representing a metabolite and representing a pathway category and then predicts whether the given metabolite is involved in the corresponding pathway category. We demonstrate that this metabolite-pathway features pair approach not only outperforms the combined performance of training separate binary classifiers but demonstrates an order of magnitude improvement in robustness: a Matthews correlation coefficient of 0.784 ± 0.013 versus 0.768 ± 0.154.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA