RESUMEN
Metformin is the first-line therapy for treating type 2 diabetes and a promising anti-aging drug. We set out to address the fundamental question of how gut microbes and nutrition, key regulators of host physiology, affect the effects of metformin. Combining two tractable genetic models, the bacterium E. coli and the nematode C. elegans, we developed a high-throughput four-way screen to define the underlying host-microbe-drug-nutrient interactions. We show that microbes integrate cues from metformin and the diet through the phosphotransferase signaling pathway that converges on the transcriptional regulator Crp. A detailed experimental characterization of metformin effects downstream of Crp in combination with metabolic modeling of the microbiota in metformin-treated type 2 diabetic patients predicts the production of microbial agmatine, a regulator of metformin effects on host lipid metabolism and lifespan. Our high-throughput screening platform paves the way for identifying exploitable drug-nutrient-microbiome interactions to improve host health and longevity through targeted microbiome therapies. VIDEO ABSTRACT.
Asunto(s)
Diabetes Mellitus Tipo 2/tratamiento farmacológico , Microbioma Gastrointestinal/efectos de los fármacos , Interacciones Microbiota-Huesped/efectos de los fármacos , Hipoglucemiantes/uso terapéutico , Metformina/uso terapéutico , Agmatina/metabolismo , Animales , Caenorhabditis elegans/microbiología , Proteína Receptora de AMP Cíclico , Escherichia coli/efectos de los fármacos , Escherichia coli/genética , Humanos , Hipoglucemiantes/farmacología , Metabolismo de los Lípidos/efectos de los fármacos , Longevidad/efectos de los fármacos , Metformina/farmacología , Nutrientes/metabolismoRESUMEN
Fluoropyrimidines are the first-line treatment for colorectal cancer, but their efficacy is highly variable between patients. We queried whether gut microbes, a known source of inter-individual variability, impacted drug efficacy. Combining two tractable genetic models, the bacterium E. coli and the nematode C. elegans, we performed three-way high-throughput screens that unraveled the complexity underlying host-microbe-drug interactions. We report that microbes can bolster or suppress the effects of fluoropyrimidines through metabolic drug interconversion involving bacterial vitamin B6, B9, and ribonucleotide metabolism. Also, disturbances in bacterial deoxynucleotide pools amplify 5-FU-induced autophagy and cell death in host cells, an effect regulated by the nucleoside diphosphate kinase ndk-1. Our data suggest a two-way bacterial mediation of fluoropyrimidine effects on host metabolism, which contributes to drug efficacy. These findings highlight the potential therapeutic power of manipulating intestinal microbiota to ensure host metabolic health and treat disease.
Asunto(s)
Antineoplásicos/metabolismo , Escherichia coli/metabolismo , Fluorouracilo/metabolismo , Microbioma Gastrointestinal , Animales , Autofagia , Caenorhabditis elegans , Muerte Celular , Neoplasias Colorrectales/tratamiento farmacológico , Dieta , Escherichia coli/enzimología , Escherichia coli/genética , Humanos , Modelos Animales , Pentosiltransferasa/genéticaRESUMEN
MOTIVATION: Liquid Chromatography Tandem Mass Spectrometry experiments aim to produce high-quality fragmentation spectra, which can be used to annotate metabolites. However, current Data-Dependent Acquisition approaches may fail to collect spectra of sufficient quality and quantity for experimental outcomes, and extend poorly across multiple samples by failing to share information across samples or by requiring manual expert input. RESULTS: We present TopNEXt, a real-time scan prioritization framework that improves data acquisition in multi-sample Liquid Chromatography Tandem Mass Spectrometry metabolomics experiments. TopNEXt extends traditional Data-Dependent Acquisition exclusion methods across multiple samples by using a Region of Interest and intensity-based scoring system. Through both simulated and lab experiments, we show that methods incorporating these novel concepts acquire fragmentation spectra for an additional 10% of our set of target peaks and with an additional 20% of acquisition intensity. By increasing the quality and quantity of fragmentation spectra, TopNEXt can help improve metabolite identification with a potential impact across a variety of experimental contexts. AVAILABILITY AND IMPLEMENTATION: TopNEXt is implemented as part of the ViMMS framework and the latest version can be found at https://github.com/glasgowcompbio/vimms. A stable version used to produce our results can be found at 10.5281/zenodo.7468914.
Asunto(s)
Metabolómica , Espectrometría de Masas/métodos , Cromatografía Liquida/métodos , Metabolómica/métodosRESUMEN
MOTIVATION: High-throughput gene expression can be used to address a wide range of fundamental biological problems, but datasets of an appropriate size are often unavailable. Moreover, existing transcriptomics simulators have been criticized because they fail to emulate key properties of gene expression data. In this article, we develop a method based on a conditional generative adversarial network to generate realistic transcriptomics data for Escherichia coli and humans. We assess the performance of our approach across several tissues and cancer-types. RESULTS: We show that our model preserves several gene expression properties significantly better than widely used simulators, such as SynTReN or GeneNetWeaver. The synthetic data preserve tissue- and cancer-specific properties of transcriptomics data. Moreover, it exhibits real gene clusters and ontologies both at local and global scales, suggesting that the model learns to approximate the gene expression manifold in a biologically meaningful way. AVAILABILITY AND IMPLEMENTATION: Code is available at: https://github.com/rvinas/adversarial-gene-expression. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Escherichia coli , Perfilación de la Expresión Génica , Humanos , Perfilación de la Expresión Génica/métodos , Expresión GénicaRESUMEN
IκB kinase ε (IKKε) is a key molecule at the crossroads of inflammation and cancer. Known to regulate cytokine secretion via NFκB and IRF3, the kinase is also a breast cancer oncogene, overexpressed in a variety of tumours. However, to what extent IKKε remodels cellular metabolism is currently unknown. Here, we used metabolic tracer analysis to show that IKKε orchestrates a complex metabolic reprogramming that affects mitochondrial metabolism and consequently serine biosynthesis independently of its canonical signalling role. We found that IKKε upregulates the serine biosynthesis pathway (SBP) indirectly, by limiting glucose-derived pyruvate utilisation in the TCA cycle, inhibiting oxidative phosphorylation. Inhibition of mitochondrial function induces activating transcription factor 4 (ATF4), which in turn drives upregulation of the expression of SBP genes. Importantly, pharmacological reversal of the IKKε-induced metabolic phenotype reduces proliferation of breast cancer cells. Finally, we show that in a highly proliferative set of ER negative, basal breast tumours, IKKε and PSAT1 are both overexpressed, corroborating the link between IKKε and the SBP in the clinical context.
Asunto(s)
Neoplasias de la Mama , Quinasa I-kappa B , Mitocondrias , Serina/biosíntesis , Neoplasias de la Mama/genética , Femenino , Humanos , Quinasa I-kappa B/genética , Mitocondrias/genética , Mitocondrias/metabolismo , Oncogenes/genéticaRESUMEN
The potential to understand fundamental biological processes from gene expression data has grown in parallel with the recent explosion of the size of data collections. However, to exploit this potential, novel analytical methods are required, capable of discovering large co-regulated gene networks. We found current methods limited in the size of correlated gene sets they could discover within biologically heterogeneous data collections, hampering the identification of multi-gene controlled fundamental cellular processes such as energy metabolism, organelle biogenesis and stress responses. Here we describe a novel biclustering algorithm called Massively Correlated Biclustering (MCbiclust) that selects samples and genes from large datasets with maximal correlated gene expression, allowing regulation of complex networks to be examined. The method has been evaluated using synthetic data and applied to large bacterial and cancer cell datasets. We show that the large biclusters discovered, so far elusive to identification by existing techniques, are biologically relevant and thus MCbiclust has great potential in the analysis of transcriptomics data to identify large-scale unknown effects hidden within the data. The identified massive biclusters can be used to develop improved transcriptomics based diagnosis tools for diseases caused by altered gene expression, or used for further network analysis to understand genotype-phenotype correlations.
Asunto(s)
Algoritmos , Conjuntos de Datos como Asunto , Perfilación de la Expresión Génica , Redes Reguladoras de Genes/fisiología , Secuenciación de Nucleótidos de Alto Rendimiento , Neoplasias/genética , Análisis por Conglomerados , Bases de Datos Genéticas , Perfilación de la Expresión Génica/métodos , Perfilación de la Expresión Génica/estadística & datos numéricos , Regulación de la Expresión Génica , Genes Reguladores , Estudios de Asociación Genética/métodos , Estudios de Asociación Genética/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Humanos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Análisis de Secuencia por Matrices de Oligonucleótidos/estadística & datos numéricos , FenotipoRESUMEN
Response to antidepressant (AD) treatment may be a more polygenic trait than previously hypothesized, with many genetic variants interacting in yet unclear ways. In this study we used methods that can automatically learn to detect patterns of statistical regularity from a sparsely distributed signal across hippocampal transcriptome measurements in a large-scale animal pharmacogenomic study to uncover genomic variations associated with AD. The study used four inbred mouse strains of both sexes, two drug treatments, and a control group (escitalopram, nortriptyline, and saline). Multi-class and binary classification using Machine Learning (ML) and regularization algorithms using iterative and univariate feature selection methods, including InfoGain, mRMR, ANOVA, and Chi Square, were used to uncover genomic markers associated with AD response. Relevant genes were selected based on Jaccard distance and carried forward for gene-network analysis. Linear association methods uncovered only one gene associated with drug treatment response. The implementation of ML algorithms, together with feature reduction methods, revealed a set of 204 genes associated with SSRI and 241 genes associated with NRI response. Although only 10% of genes overlapped across the two drugs, network analysis shows that both drugs modulated the CREB pathway, through different molecular mechanisms. Through careful implementation and optimisations, the algorithms detected a weak signal used to predict whether an animal was treated with nortriptyline (77%) or escitalopram (67%) on an independent testing set. The results from this study indicate that the molecular signature of AD treatment may include a much broader range of genomic markers than previously hypothesized, suggesting that response to medication may be as complex as the pathology. The search for biomarkers of antidepressant treatment response could therefore consider a higher number of genetic markers and their interactions. Through predominately different molecular targets and mechanisms of action, the two drugs modulate the same Creb1 pathway which plays a key role in neurotrophic responses and in inflammatory processes. © 2016 The Authors. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics Published by Wiley Periodicals, Inc.
Asunto(s)
Antidepresivos/uso terapéutico , Inhibidores de Captación de Serotonina y Norepinefrina/farmacología , Animales , Citalopram/uso terapéutico , Proteína de Unión a Elemento de Respuesta al AMP Cíclico , Depresión/tratamiento farmacológico , Trastorno Depresivo/tratamiento farmacológico , Trastorno Depresivo/genética , Modelos Animales de Enfermedad , Femenino , Hipocampo , Masculino , Ratones , Herencia Multifactorial/genética , Nortriptilina/uso terapéutico , Farmacogenética , Inhibidores Selectivos de la Recaptación de Serotonina/uso terapéutico , Inhibidores de Captación de Serotonina y Norepinefrina/uso terapéutico , Transcriptoma/genética , Resultado del TratamientoRESUMEN
Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools.
Asunto(s)
Biología Computacional/métodos , Biología Molecular/métodos , Anotación de Secuencia Molecular , Proteínas/fisiología , Algoritmos , Animales , Bases de Datos de Proteínas , Exorribonucleasas/clasificación , Exorribonucleasas/genética , Exorribonucleasas/fisiología , Predicción , Humanos , Proteínas/química , Proteínas/clasificación , Proteínas/genética , Especificidad de la EspecieRESUMEN
Here, we present the new UCL Bioinformatics Group's PSIPRED Protein Analysis Workbench. The Workbench unites all of our previously available analysis methods into a single web-based framework. The new web portal provides a greatly streamlined user interface with a number of new features to allow users to better explore their results. We offer a number of additional services to enable computationally scalable execution of our prediction methods; these include SOAP and XML-RPC web server access and new HADOOP packages. All software and services are available via the UCL Bioinformatics Group website at http://bioinf.cs.ucl.ac.uk/.
Asunto(s)
Conformación Proteica , Programas Informáticos , Animales , Internet , Proteínas de la Membrana/química , Ratones , Proteínas/química , Análisis de Secuencia de Proteína , Homología Estructural de ProteínaRESUMEN
MOTIVATION: Linkage analysis remains an important tool in elucidating the genetic component of disease and has become even more important with the advent of whole exome sequencing, enabling the user to focus on only those genomic regions co-segregating with Mendelian traits. Unfortunately, methods to perform multipoint linkage analysis scale poorly with either the number of markers or with the size of the pedigree. Large pedigrees with many markers can only be evaluated with Markov chain Monte Carlo (MCMC) methods that are slow to converge and, as no attempts have been made to exploit parallelism, massively underuse available processing power. Here, we describe SWIFTLINK, a novel application that performs MCMC linkage analysis by spreading the computational burden between multiple processor cores and a graphics processing unit (GPU) simultaneously. SWIFTLINK was designed around the concept of explicitly matching the characteristics of an algorithm with the underlying computer architecture to maximize performance. RESULTS: We implement our approach using existing Gibbs samplers redesigned for parallel hardware. We applied SWIFTLINK to a real-world dataset, performing parametric multipoint linkage analysis on a highly consanguineous pedigree with EAST syndrome, containing 28 members, where a subset of individuals were genotyped with single nucleotide polymorphisms (SNPs). In our experiments with a four core CPU and GPU, SWIFTLINK achieves a 8.5× speed-up over the single-threaded version and a 109× speed-up over the popular linkage analysis program SIMWALK. AVAILABILITY: SWIFTLINK is available at https://github.com/ajm/swiftlink. All source code is licensed under GPLv3.
Asunto(s)
Ligamiento Genético , Programas Informáticos , Algoritmos , Genómica , Pérdida Auditiva Sensorineural/genética , Humanos , Discapacidad Intelectual/genética , Cadenas de Markov , Método de Montecarlo , Linaje , Polimorfismo de Nucleótido Simple , Convulsiones/genéticaRESUMEN
Adaptive metabolic switches are proposed to underlie conversions between cellular states during normal development as well as in cancer evolution. Metabolic adaptations represent important therapeutic targets in tumors, highlighting the need to characterize the full spectrum, characteristics, and regulation of the metabolic switches. To investigate the hypothesis that metabolic switches associated with specific metabolic states can be recognized by locating large alternating gene expression patterns, we developed a method to identify interspersed gene sets by massive correlated biclustering and to predict their metabolic wiring. Testing the method on breast cancer transcriptome datasets revealed a series of gene sets with switch-like behavior that could be used to predict mitochondrial content, metabolic activity, and central carbon flux in tumors. The predictions were experimentally validated by bioenergetic profiling and metabolic flux analysis of 13C-labeled substrates. The metabolic switch positions also distinguished between cellular states, correlating with tumor pathology, prognosis, and chemosensitivity. The method is applicable to any large and heterogeneous transcriptome dataset to discover metabolic and associated pathophysiological states. Significance: A method for identifying the transcriptomic signatures of metabolic switches underlying divergent routes of cellular transformation stratifies breast cancer into metabolic subtypes, predicting their biology, architecture, and clinical outcome.
Asunto(s)
Neoplasias de la Mama , Mitocondrias , Familia de Multigenes , Humanos , Neoplasias de la Mama/genética , Neoplasias de la Mama/metabolismo , Neoplasias de la Mama/patología , Femenino , Mitocondrias/metabolismo , Mitocondrias/genética , Transcriptoma , Perfilación de la Expresión Génica/métodos , Regulación Neoplásica de la Expresión Génica , Pronóstico , Metabolismo Energético/genéticaRESUMEN
BACKGROUND: Accurate protein function annotation is a severe bottleneck when utilizing the deluge of high-throughput, next generation sequencing data. Keeping database annotations up-to-date has become a major scientific challenge that requires the development of reliable automatic predictors of protein function. The CAFA experiment provided a unique opportunity to undertake comprehensive 'blind testing' of many diverse approaches for automated function prediction. We report on the methodology we used for this challenge and on the lessons we learnt. METHODS: Our method integrates into a single framework a wide variety of biological information sources, encompassing sequence, gene expression and protein-protein interaction data, as well as annotations in UniProt entries. The methodology transfers functional categories based on the results from complementary homology-based and feature-based analyses. We generated the final molecular function and biological process assignments by combining the initial predictions in a probabilistic manner, which takes into account the Gene Ontology hierarchical structure. RESULTS: We propose a novel scoring function called COmbined Graph-Information Content similarity (COGIC) score for the comparison of predicted functional categories and benchmark data. We demonstrate that our integrative approach provides increased scope and accuracy over both the component methods and the naïve predictors. In line with previous studies, we find that molecular function predictions are more accurate than biological process assignments. CONCLUSIONS: Overall, the results indicate that there is considerable room for improvement in the field. It still remains for the community to invest a great deal of effort to make automated function prediction a useful and routine component in the toolbox of life scientists. As already witnessed in other areas, community-wide blind testing experiments will be pivotal in establishing standards for the evaluation of prediction accuracy, in fostering advancements and new ideas, and ultimately in recording progress.
Asunto(s)
Proteínas/fisiología , Biología Computacional/métodos , Bases de Datos de Proteínas , Evolución Molecular , Expresión Génica , Anotación de Secuencia Molecular , Mapeo de Interacción de Proteínas , Proteínas/química , Proteínas/genética , Análisis de SecuenciaRESUMEN
Data-Dependent and Data-Independent Acquisition modes (DDA and DIA, respectively) are both widely used to acquire MS2 spectra in untargeted liquid chromatography tandem mass spectrometry (LC-MS/MS) metabolomics analyses. Despite their wide use, little work has been attempted to systematically compare their MS/MS spectral annotation performance in untargeted settings due to the lack of ground truth and the costs involved in running a large number of acquisitions. Here, we present a systematic in silico comparison of these two acquisition methods in untargeted metabolomics by extending our Virtual Metabolomics Mass Spectrometer (ViMMS) framework with a DIA module. Our results show that the performance of these methods varies with the average number of co-eluting ions as the most important factor. At low numbers, DIA outperforms DDA, but at higher numbers, DDA has an advantage as DIA can no longer deal with the large amount of overlapping ion chromatograms. Results from simulation were further validated on an actual mass spectrometer, demonstrating that using ViMMS we can draw conclusions from simulation that translate well into the real world. The versatility of the Virtual Metabolomics Mass Spectrometer (ViMMS) framework in simulating different parameters of both Data-Dependent and Data-Independent Acquisition (DDA and DIA) modes is a key advantage of this work. Researchers can easily explore and compare the performance of different acquisition methods within the ViMMS framework, without the need for expensive and time-consuming experiments with real experimental data. By identifying the strengths and limitations of each acquisition method, researchers can optimize their choice and obtain more accurate and robust results. Furthermore, the ability to simulate and validate results using the ViMMS framework can save significant time and resources, as it eliminates the need for numerous experiments. This work not only provides valuable insights into the performance of DDA and DIA, but it also opens the door for further advancements in LC-MS/MS data acquisition methods.
RESUMEN
The introduction of pneumococcal conjugate vaccines necessitates continued monitoring of circulating strains to assess vaccine efficacy and replacement serotypes. Conventional serological methods are costly, labor-intensive, and prone to misidentification, while current DNA-based methods have limited serotype coverage requiring multiple PCR primers. In this study, a computer algorithm was developed to interrogate the capsulation locus (cps) of vaccine serotypes to locate primer pairs in conserved regions that border variable regions and could differentiate between serotypes. In silico analysis of cps from 92 serotypes indicated that a primer pair spanning the regulatory gene cpsB could putatively amplify 84 serotypes and differentiate 46. This primer set was specific to Streptococcus pneumoniae, with no amplification observed for other species, including S. mitis, S. oralis, and S. pseudopneumoniae. One hundred thirty-eight pneumococcal strains covering 48 serotypes were tested. Of 23 vaccine serotypes included in the study, most (19/22, 86%) were identified correctly at least to the serogroup level, including all of the 13-valent conjugate vaccine and other replacement serotypes. Reproducibility was demonstrated by the correct sequetyping of different strains of a serotype. This novel sequence-based method employing a single PCR primer pair is cost-effective and simple. Furthermore, it has the potential to identify new serotypes that may evolve in the future.
Asunto(s)
Tipificación Molecular/métodos , Reacción en Cadena de la Polimerasa/métodos , Streptococcus pneumoniae/clasificación , Streptococcus pneumoniae/genética , Biología Computacional , Secuencia Conservada , Cartilla de ADN/genética , ADN Bacteriano/química , ADN Bacteriano/genética , Humanos , Datos de Secuencia Molecular , Infecciones Neumocócicas/microbiología , Reproducibilidad de los Resultados , Sensibilidad y Especificidad , Análisis de Secuencia de ADN , Serotipificación/métodos , Streptococcus pneumoniae/aislamiento & purificaciónRESUMEN
Importance: Machine learning could be used to predict the likelihood of diagnosis and severity of illness. Lack of COVID-19 patient data has hindered the data science community in developing models to aid in the response to the pandemic. Objectives: To describe the rapid development and evaluation of clinical algorithms to predict COVID-19 diagnosis and hospitalization using patient data by citizen scientists, provide an unbiased assessment of model performance, and benchmark model performance on subgroups. Design, Setting, and Participants: This diagnostic and prognostic study operated a continuous, crowdsourced challenge using a model-to-data approach to securely enable the use of regularly updated COVID-19 patient data from the University of Washington by participants from May 6 to December 23, 2020. A postchallenge analysis was conducted from December 24, 2020, to April 7, 2021, to assess the generalizability of models on the cumulative data set as well as subgroups stratified by age, sex, race, and time of COVID-19 test. By December 23, 2020, this challenge engaged 482 participants from 90 teams and 7 countries. Main Outcomes and Measures: Machine learning algorithms used patient data and output a score that represented the probability of patients receiving a positive COVID-19 test result or being hospitalized within 21 days after receiving a positive COVID-19 test result. Algorithms were evaluated using area under the receiver operating characteristic curve (AUROC) and area under the precision recall curve (AUPRC) scores. Ensemble models aggregating models from the top challenge teams were developed and evaluated. Results: In the analysis using the cumulative data set, the best performance for COVID-19 diagnosis prediction was an AUROC of 0.776 (95% CI, 0.775-0.777) and an AUPRC of 0.297, and for hospitalization prediction, an AUROC of 0.796 (95% CI, 0.794-0.798) and an AUPRC of 0.188. Analysis on top models submitting to the challenge showed consistently better model performance on the female group than the male group. Among all age groups, the best performance was obtained for the 25- to 49-year age group, and the worst performance was obtained for the group aged 17 years or younger. Conclusions and Relevance: In this diagnostic and prognostic study, models submitted by citizen scientists achieved high performance for the prediction of COVID-19 testing and hospitalization outcomes. Evaluation of challenge models on demographic subgroups and prospective data revealed performance discrepancies, providing insights into the potential bias and limitations in the models.
Asunto(s)
Algoritmos , Benchmarking , COVID-19/diagnóstico , Reglas de Decisión Clínica , Colaboración de las Masas , Hospitalización/estadística & datos numéricos , Aprendizaje Automático , Adolescente , Adulto , Anciano , Anciano de 80 o más Años , Área Bajo la Curva , COVID-19/epidemiología , COVID-19/terapia , Prueba de COVID-19 , Niño , Preescolar , Femenino , Humanos , Lactante , Recién Nacido , Masculino , Persona de Mediana Edad , Modelos Estadísticos , Pronóstico , Curva ROC , Índice de Severidad de la Enfermedad , Washingtón/epidemiología , Adulto JovenRESUMEN
Gene duplications represent an important class of evolutionary events that is likely to have contributed to the unique human phenotype in the short evolutionary time since the human-chimpanzee divergence. With the availability of both human and chimpanzee genome drafts in high coverage re-sequencing assemblies and the high annotation quality of most human genes, it should now be possible to identify all human lineage-specific gene duplication events (human inparalogues) and a few pioneering studies have attempted to do that. However, the different levels of coverage in the human and chimpanzee's genomes assemblies, and the differing levels of gene annotation, have led to problematic assumptions and oversimplifications in the algorithms and the datasets used to detect human lineage-specific gene duplications. In this study, we have developed a set of bioinformatic tools to overcome a number of the conceptual problems that are prevalent in previous studies and have collected a reliable and representative set of human inparalogues.
Asunto(s)
Biología Computacional , Evolución Molecular , Duplicación de Gen , Genoma Humano , Algoritmos , Animales , Humanos , Modelos Biológicos , Anotación de Secuencia Molecular , Pan troglodytes/genética , Proteoma/genéticaRESUMEN
Transcription of a large set of nuclear-encoded genes underlies biogenesis of mitochondria, regulated by a complex network of transcription factors and co-regulators. A remarkable heterogeneity can be detected in the expression of these genes in different cell types and tissues, and the recent availability of large gene expression compendiums allows the quantification of specific mitochondrial biogenesis patterns. We have developed a method to effectively perform this task. Massively correlated biclustering (MCbiclust) is a novel bioinformatics method that has been successfully applied to identify co-regulation patterns in large genesets, underlying essential cellular functions and determining cell types. The method has been recently evaluated and made available as a package in Bioconductor for R. One of the potential applications of the method is to compare expression of nuclear-encoded mitochondrial genes or larger sets of metabolism-related genes between different cell types or cellular metabolic states. Here we describe the essential steps to use MCbiclust as a tool to investigate co-regulation of mitochondrial genes and metabolic pathways.
Asunto(s)
Análisis por Conglomerados , Biología Computacional , Perfilación de la Expresión Génica , Regulación de la Expresión Génica , Genes Mitocondriales , Mitocondrias/metabolismo , Algoritmos , Biología Computacional/métodos , Bases de Datos Genéticas , Perfilación de la Expresión Génica/métodos , Redes Reguladoras de Genes , Redes y Vías MetabólicasRESUMEN
Domain prediction from sequence is a particularly challenging task, and currently, a large variety of different methodologies are employed to tackle the task. Here we try to classify these diverse approaches into a number of broad categories. Completely automatic domain prediction from sequence alone is currently fraught with problems, but this should not be so surprising since human experts currently have significant disagreement on domain assignment even when given the structures. It can be argued that we should only test the domain prediction methods on benchmark data that human experts agree upon and this is the approach we take in this paper. Even for the data sets on which human experts agree, automatic structure-based domain assignment still cannot always agree, and so again it is still unlikely that domain prediction methods will reliably obtain correct results completely automatically. We make the argument that computer-assisted domain prediction is a more achievable goal. With this aim in mind, we present the DomPred server. This server provides the user with the results from two completely different categories of method (DPS and DomSSEA). In this paper, each method is individually benchmarked against one of the latest domain prediction benchmarks to provide information about their respective reliabilities. A variety of different benchmark scores are employed since the accuracy of a domain prediction method depends critically on what types of results one wishes to obtain (single/multi-domain classification, domain number, residue linker positions, etc.). Also both of these methods, implemented within the DomPred server, can suggest alternative domain predictions, allowing the user to make the final decision based on these results and applying their own background knowledge to the problem. The DomPred server is available from the URL:http://bioinf.cs.ucl.ac.uk/software.html.
Asunto(s)
Computadores , Bases de Datos de Proteínas , Proteínas/química , Conformación ProteicaRESUMEN
A number of state-of-the-art protein structure prediction servers have been developed by researchers working in the Bioinformatics Unit at University College London. The popular PSIPRED server allows users to perform secondary structure prediction, transmembrane topology prediction and protein fold recognition. More recent servers include DISOPRED for the prediction of protein dynamic disorder and DomPred for domain boundary prediction. These servers are available from our software home page at http://bioinf.cs.ucl.ac.uk/software.html.
Asunto(s)
Estructura Secundaria de Proteína , Estructura Terciaria de Proteína , Programas Informáticos , Biología Computacional , Humanos , Internet , Londres , Proteínas de la Membrana/química , Modelos Moleculares , Proteína de Unión al Tracto de Polipirimidina/química , Pliegue de ProteínaRESUMEN
BACKGROUND: In order to maintain the most comprehensive structural annotation databases we must carry out regular updates for each proteome using the latest profile-profile fold recognition methods. The ability to carry out these updates on demand is necessary to keep pace with the regular updates of sequence and structure databases. Providing the highest quality structural models requires the most intensive profile-profile fold recognition methods running with the very latest available sequence databases and fold libraries. However, running these methods on such a regular basis for every sequenced proteome requires large amounts of processing power. In this paper we describe and benchmark the JYDE (Job Yield Distribution Environment) system, which is a meta-scheduler designed to work above cluster schedulers, such as Sun Grid Engine (SGE) or Condor. We demonstrate the ability of JYDE to distribute the load of genomic-scale fold recognition across multiple independent Grid domains. We use the most recent profile-profile version of our mGenTHREADER software in order to annotate the latest version of the Human proteome against the latest sequence and structure databases in as short a time as possible. RESULTS: We show that our JYDE system is able to scale to large numbers of intensive fold recognition jobs running across several independent computer clusters. Using our JYDE system we have been able to annotate 99.9% of the protein sequences within the Human proteome in less than 24 hours, by harnessing over 500 CPUs from 3 independent Grid domains. CONCLUSION: This study clearly demonstrates the feasibility of carrying out on demand high quality structural annotations for the proteomes of major eukaryotic organisms. Specifically, we have shown that it is now possible to provide complete regular updates of profile-profile based fold recognition models for entire eukaryotic proteomes, through the use of Grid middleware such as JYDE.