RESUMO
Clustering is widely used in bioinformatics and many other fields, with applications from exploratory analysis to prediction. Many types of data have associated uncertainty or measurement error, but this is rarely used to inform the clustering. We present Dirichlet Process Mixtures with Uncertainty (DPMUnc), an extension of a Bayesian nonparametric clustering algorithm which makes use of the uncertainty associated with data points. We show that DPMUnc out-performs existing methods on simulated data. We cluster immune-mediated diseases (IMD) using GWAS summary statistics, which have uncertainty linked with the sample size of the study. DPMUnc separates autoimmune from autoinflammatory diseases and isolates other subgroups such as adult-onset arthritis. We additionally consider how DPMUnc can be used to cluster gene expression datasets that have been summarised using gene signatures. We first introduce a novel procedure for generating a summary of a gene signature on a dataset different to the one where it was discovered, which incorporates a measure of the variability in expression across signature genes within each individual. We summarise three public gene expression datasets containing patients with a range of IMD, using three relevant gene signatures. We find association between disease and the clusters returned by DPMUnc, with clustering structure replicated across the datasets. The significance of this work is two-fold. Firstly, we demonstrate that when data has associated uncertainty, this uncertainty should be used to inform clustering and we present a method which does this, DPMUnc. Secondly, we present a procedure for using gene signatures in datasets other than where they were originally defined. We show the value of this procedure by summarising gene expression data from patients with immune-mediated diseases using relevant gene signatures, and clustering these patients using DPMUnc.
Assuntos
Algoritmos , Teorema de Bayes , Biologia Computacional , Humanos , Análise por Conglomerados , Incerteza , Biologia Computacional/métodos , Estudo de Associação Genômica Ampla/métodos , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Perfilação da Expressão Gênica/estatística & dados numéricos , Perfilação da Expressão Gênica/métodos , Bases de Dados Genéticas/estatística & dados numéricos , Simulação por ComputadorRESUMO
While genome wide association studies (GWASs) of Alzheimer's Disease (AD) in European (EUR) ancestry cohorts have identified approximately 83 potentially independent AD risk loci, progress in non-European populations has lagged. In this study, data from the Million Veteran Program (MVP), a biobank which includes genetic data from more than 650,000 US Veteran participants, was used to examine dementia genetics in an African descent (AFR) cohort. A GWAS of Alzheimer's disease and related dementias (ADRD), an expanded AD phenotype including dementias such as vascular and non-specific dementia that included 4012 cases and 18,435 controls age 60+ in AFR MVP participants was performed. A proxy dementia GWAS based on survey-reported parental AD or dementia (n = 4385 maternal cases, 2256 paternal cases, and 45,970 controls) was also performed. These two GWASs were meta-analyzed, and then subsequently compared and meta-analyzed with the results from a previous AFR AD GWAS from the Alzheimer's Disease Genetics Consortium (ADGC). A meta-analysis of common variants across the MVP ADRD and proxy GWASs yielded GWAS significant associations in the region of APOE (p = 2.48 × 10-101), in ROBO1 (rs11919682, p = 1.63 × 10-8), and RNA RP11-340A13.2 (rs148433063, p = 8.56 × 10-9). The MVP/ADGC meta-analysis yielded additional significant SNPs near known AD risk genes TREM2 (rs73427293, p = 2.95 × 10-9), CD2AP (rs7738720, p = 1.14 × 10-9), and ABCA7 (rs73505251, p = 3.26 × 10-10), although the peak variants observed in these genes differed from those previously reported in EUR and AFR cohorts. Of the genes in or near suggestive or genome-wide significant associated variants, nine (CDA, SH2D5, DCBLD1, EML6, GOPC, ABCA7, ROS1, TMCO4, and TREM2) were differentially expressed in the brains of AD cases and controls. This represents the largest AFR GWAS of AD and dementia, finding non-APOE GWAS-significant common SNPs associated with dementia. Increasing representation of AFR participants is an important priority in genetic studies and may lead to increased insight into AD pathophysiology and reduce health disparities.
Assuntos
Doença de Alzheimer , Negro ou Afro-Americano , Militares , Idoso , Humanos , Pessoa de Meia-Idade , Doença de Alzheimer/epidemiologia , Doença de Alzheimer/etnologia , Doença de Alzheimer/genética , Negro ou Afro-Americano/genética , Negro ou Afro-Americano/estatística & dados numéricos , Bases de Dados Genéticas/estatística & dados numéricos , Demência/epidemiologia , Demência/etnologia , Demência/genética , Perfilação da Expressão Gênica , Estudo de Associação Genômica Ampla , Genótipo , Militares/estatística & dados numéricos , Polimorfismo Genético , Estados Unidos/epidemiologia , Predisposição Genética para Doença/epidemiologia , Predisposição Genética para Doença/etnologia , Predisposição Genética para Doença/genéticaRESUMO
Type I toxin-antitoxin (T1TA) systems constitute a large class of genetic modules with antisense RNA (asRNA)-mediated regulation of gene expression. They are widespread in bacteria and consist of an mRNA coding for a toxic protein and a noncoding asRNA that acts as an antitoxin preventing the synthesis of the toxin by directly base-pairing to its cognate mRNA. The co- and post-transcriptional regulation of T1TA systems is intimately linked to RNA sequence and structure, therefore it is essential to have an accurate annotation of the mRNA and asRNA molecules to understand this regulation. However, most T1TA systems have been identified by means of bioinformatic analyses solely based on the toxin protein sequences, and there is no central repository of information on their specific RNA features. Here we present the first database dedicated to type I TA systems, named T1TAdb. It is an open-access web database (https://d-lab.arna.cnrs.fr/t1tadb) with a collection of â¼1900 loci in â¼500 bacterial strains in which a toxin-coding sequence has been previously identified. RNA molecules were annotated with a bioinformatic procedure based on key determinants of the mRNA structure and the genetic organization of the T1TA loci. Besides RNA and protein secondary structure predictions, T1TAdb also identifies promoter, ribosome-binding, and mRNA-asRNA interaction sites. It also includes tools for comparative analysis, such as sequence similarity search and computation of structural multiple alignments, which are annotated with covariation information. To our knowledge, T1TAdb represents the largest collection of features, sequences, and structural annotations on this class of genetic modules.
Assuntos
Antitoxinas/genética , Proteínas de Bactérias/genética , Biologia Computacional/métodos , Bases de Dados Genéticas/estatística & dados numéricos , RNA Antissenso/genética , Sistemas Toxina-Antitoxina/genética , Regulação Bacteriana da Expressão GênicaRESUMO
Though single cell RNA sequencing (scRNA-seq) technologies have been well developed, the acquisition of large-scale single cell expression data may still lead to high costs. Single cell expression profile has its inherent sparse properties, which makes it compressible, thus providing opportunities for solutions. Here, by computational simulation as well as experiment of 54 single cells, we propose that expression profiles can be compressed from the dimension of samples by overlapped assigning each cell into plenty of pools. And we prove that expression profiles can be inferred from these pool expression data with overlapped pooling design and compressed sensing strategy. We also show that by combining this approach with plate-based scRNA-seq measurement, it can maintain its superiorities in gene detection sensitivity and individual identity and recover the expression profile with high precision, while saving about half of the library cost. This method can inspire novel conceptions on the measurement, storage or computation improvements for other compressible signals in many biological areas.
Assuntos
Algoritmos , Simulação por Computador , Perfilação da Expressão Gênica/métodos , Modelos Teóricos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Animais , Bases de Dados Genéticas/estatística & dados numéricos , Biblioteca Gênica , Humanos , Reprodutibilidade dos TestesRESUMO
RNA molecules fold into complex structures that are important across many biological processes. Recent technological developments have enabled transcriptome-wide probing of RNA secondary structure using nucleases and chemical modifiers. These approaches have been widely applied to capture RNA secondary structure in many studies, but gathering and presenting such data from very different technologies in a comprehensive and accessible way has been challenging. Existing RNA structure probing databases usually focus on low-throughput or very specific datasets. Here, we present a comprehensive RNA structure probing database called RASP (RNA Atlas of Structure Probing) by collecting 161 deduplicated transcriptome-wide RNA secondary structure probing datasets from 38 papers. RASP covers 18 species across animals, plants, bacteria, fungi, and also viruses, and categorizes 18 experimental methods including DMS-seq, SHAPE-Seq, SHAPE-MaP, and icSHAPE, etc. Specially, RASP curates the up-to-date datasets of several RNA secondary structure probing studies for the RNA genome of SARS-CoV-2, the RNA virus that caused the on-going COVID-19 pandemic. RASP also provides a user-friendly interface to query, browse, and visualize RNA structure profiles, offering a shortcut to accessing RNA secondary structures grounded in experimental data. The database is freely available at http://rasp.zhanglab.net.
Assuntos
Biologia Computacional/estatística & dados numéricos , Bases de Dados Genéticas/estatística & dados numéricos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Conformação de Ácido Nucleico , RNA/química , Transcriptoma , Animais , COVID-19/epidemiologia , COVID-19/prevenção & controle , COVID-19/virologia , Biologia Computacional/métodos , Genoma Viral/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Pandemias , RNA/genética , Sondas RNA/genética , RNA Bacteriano/química , RNA Bacteriano/genética , RNA Fúngico/química , RNA Fúngico/genética , RNA de Plantas/química , RNA de Plantas/genética , RNA Viral/química , RNA Viral/genética , SARS-CoV-2/genética , SARS-CoV-2/fisiologiaRESUMO
A major challenge emerging in genomic medicine is how to assess best disease risk from rare or novel variants found in disease-related genes. The expanding volume of data generated by very large phenotyping efforts coupled to DNA sequence data presents an opportunity to reinterpret genetic liability of disease risk. Here we propose a framework to estimate the probability of disease given the presence of a genetic variant conditioned on features of that variant. We refer to this as the penetrance, the fraction of all variant heterozygotes that will present with disease. We demonstrate this methodology using a well-established disease-gene pair, the cardiac sodium channel gene SCN5A and the heart arrhythmia Brugada syndrome. From a review of 756 publications, we developed a pattern mixture algorithm, based on a Bayesian Beta-Binomial model, to generate SCN5A penetrance probabilities for the Brugada syndrome conditioned on variant-specific attributes. These probabilities are determined from variant-specific features (e.g. function, structural context, and sequence conservation) and from observations of affected and unaffected heterozygotes. Variant functional perturbation and structural context prove most predictive of Brugada syndrome penetrance.
Assuntos
Síndrome de Brugada/genética , Modelos Genéticos , Canal de Sódio Disparado por Voltagem NAV1.5/genética , Penetrância , Polimorfismo de Nucleotídeo Único , Algoritmos , Teorema de Bayes , Distribuição Binomial , Síndrome de Brugada/terapia , Bases de Dados Genéticas/estatística & dados numéricos , Conjuntos de Dados como Assunto , Humanos , Medicina de Precisão/métodosRESUMO
Traditional univariate genome-wide association studies generate false positives and negatives due to difficulties distinguishing associated variants from variants with spurious nonzero effects that do not directly influence the trait. Recent efforts have been directed at identifying genes or signaling pathways enriched for mutations in quantitative traits or case-control studies, but these can be computationally costly and hampered by strict model assumptions. Here, we present gene-ε, a new approach for identifying statistical associations between sets of variants and quantitative traits. Our key insight is that enrichment studies on the gene-level are improved when we reformulate the genome-wide SNP-level null hypothesis to identify spurious small-to-intermediate SNP effects and classify them as non-causal. gene-ε efficiently identifies enriched genes under a variety of simulated genetic architectures, achieving greater than a 90% true positive rate at 1% false positive rate for polygenic traits. Lastly, we apply gene-ε to summary statistics derived from six quantitative traits using European-ancestry individuals in the UK Biobank, and identify enriched genes that are in biologically relevant pathways.
Assuntos
Estudo de Associação Genômica Ampla/estatística & dados numéricos , Modelos Genéticos , Herança Multifatorial/genética , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas/genética , Interpretação Estatística de Dados , Bases de Dados Genéticas/estatística & dados numéricos , Humanos , Reino Unido , População Branca/genéticaRESUMO
Cancer genomes with mutations in the exonuclease domain of Polymerase Epsilon (POLE) present with an extraordinarily high somatic mutation burden. In vitro studies have shown that distinct POLE mutants exhibit different polymerase activity. Yet, genome-wide mutation patterns and driver mutation formation arising from different POLE mutants remains unclear. Here, we curated somatic mutation calls from 7,345 colorectal cancer samples from published studies and publicly available databases. These include 44 POLE mutant samples including 9 with whole genome sequencing data available. The POLE mutant samples were categorized based on the specific POLE mutation present. Mutation spectrum, associations of somatic mutations with epigenomics features and co-occurrence with specific driver mutations were examined across different POLE mutants. We found that different POLE mutants exhibit distinct mutation spectrum with significantly higher relative frequency of C>T mutations in POLE V411L mutants. Our analysis showed that this increase frequency in C>T mutations is not dependent on DNA methylation and not associated with other genomic features and is thus specifically due to DNA sequence context alone. Notably, we found strong association of the TP53 R213* mutation specifically with POLE P286R mutants. This truncation mutation occurs within the TT[C>T]GA context. For C>T mutations, this sequence context is significantly more likely to be mutated in POLE P286R mutants compared with other POLE exonuclease domain mutants. This study refines our understanding of DNA polymerase fidelity and underscores genome-wide mutation spectrum and specific cancer driver mutation formation observed in POLE mutant cancers.
Assuntos
Carcinogênese/genética , Neoplasias Colorretais/genética , DNA Polimerase II/metabolismo , Proteínas de Ligação a Poli-ADP-Ribose/metabolismo , Domínios Proteicos/genética , Proteína Supressora de Tumor p53/genética , Ilhas de CpG/genética , Citosina/metabolismo , Metilação de DNA/genética , Análise Mutacional de DNA/estatística & dados numéricos , DNA Polimerase II/genética , Bases de Dados Genéticas/estatística & dados numéricos , Conjuntos de Dados como Assunto , Epigênese Genética , Humanos , Mutação , Proteínas de Ligação a Poli-ADP-Ribose/genética , Sequenciamento Completo do Genoma/estatística & dados numéricosRESUMO
Recombination is a major force that shapes genetic diversity. Determination of recombination rate is important and can theoretically be improved by increasing the sample size. However, it is nearly impossible to estimate recombination rates using traditional population genetics methods when the sample size is large because these methods are highly computationally demanding. In this study, we used a refined machine learning approach to estimate the recombination rate of the human genome using the UK10K human genomic dataset with 7,562 genomic sequences and its three subsets with 200, 400 and 2,000 genomic sequences. The estimation was performed under the human Out-of-Africa demographic model. We not only obtained an accurate human genetic map, but also found that the fluctuation of estimated recombination rate is reduced along the human genome when the sample size increases. The estimated UK10K recombination rate heterogeneity is less than that estimated from its subsets. Our results demonstrate how the sample size affects the estimated recombination rate, and analyses of a larger number of genomes result in a more precise estimation of recombination rate. The accurate genetic map based on UK10K data set is also expected to benefit other human biology researches.
Assuntos
Mapeamento Cromossômico/métodos , Genoma Humano , Mapeamento Cromossômico/estatística & dados numéricos , Bases de Dados Genéticas/estatística & dados numéricos , Genética Populacional , Humanos , Aprendizado de Máquina , Modelos Genéticos , Recombinação Genética , Tamanho da Amostra , Software , Reino UnidoRESUMO
BACKGROUND: Colorectal cancer (CRC) is major cancer-related death. The aim of this study was to identify differentially expressed and differentially methylated genes, contributing to explore the molecular mechanism of CRC. METHODS: Firstly, the data of gene transcriptome and genome-wide DNA methylation expression were downloaded from the Gene Expression Omnibus database. Secondly, functional analysis of differentially expressed and differentially methylated genes was performed, followed by protein-protein interaction (PPI) analysis. Thirdly, the Cancer Genome Atlas (TCGA) dataset and in vitro experiment was used to validate the expression of selected differentially expressed and differentially methylated genes. Finally, diagnosis and prognosis analysis of selected differentially expressed and differentially methylated genes was performed. RESULTS: Up to 1958 differentially expressed (1025 up-regulated and 993 down-regulated) genes and 858 differentially methylated (800 hypermethylated and 58 hypomethylated) genes were identified. Interestingly, some genes, such as GFRA2 and MDFI, were differentially expressed-methylated genes. Purine metabolism (involved IMPDH1), cell adhesion molecules and PI3K-Akt signaling pathway were significantly enriched signaling pathways. GFRA2, FOXQ1, CDH3, CLDN1, SCGN, BEST4, CXCL12, CA7, SHMT2, TRIP13, MDFI and IMPDH1 had a diagnostic value for CRC. In addition, BEST4, SHMT2 and TRIP13 were significantly associated with patients' survival. CONCLUSIONS: The identified altered genes may be involved in tumorigenesis of CRC. In addition, BEST4, SHMT2 and TRIP13 may be considered as diagnosis and prognostic biomarkers for CRC patients.
Assuntos
Biomarcadores Tumorais/genética , Carcinogênese/genética , Neoplasias Colorretais/genética , Metilação de DNA , Regulação Neoplásica da Expressão Gênica , Neoplasias Colorretais/diagnóstico , Neoplasias Colorretais/patologia , Bases de Dados Genéticas/estatística & dados numéricos , Conjuntos de Dados como Assunto , Feminino , Perfilação da Expressão Gênica , Humanos , Masculino , Pessoa de Meia-Idade , Prognóstico , Transdução de Sinais , TranscriptomaRESUMO
Computational integrative analysis has become a significant approach in the data-driven exploration of biological problems. Many integration methods for cancer subtyping have been proposed, but evaluating these methods has become a complicated problem due to the lack of gold standards. Moreover, questions of practical importance remain to be addressed regarding the impact of selecting appropriate data types and combinations on the performance of integrative studies. Here, we constructed three classes of benchmarking datasets of nine cancers in TCGA by considering all the eleven combinations of four multi-omics data types. Using these datasets, we conducted a comprehensive evaluation of ten representative integration methods for cancer subtyping in terms of accuracy measured by combining both clustering accuracy and clinical significance, robustness, and computational efficiency. We subsequently investigated the influence of different omics data on cancer subtyping and the effectiveness of their combinations. Refuting the widely held intuition that incorporating more types of omics data always produces better results, our analyses showed that there are situations where integrating more omics data negatively impacts the performance of integration methods. Our analyses also suggested several effective combinations for most cancers under our studies, which may be of particular interest to researchers in omics data analysis.
Assuntos
Biologia Computacional/métodos , Neoplasias/classificação , Neoplasias/genética , Algoritmos , Biomarcadores Tumorais/genética , Interpretação Estatística de Dados , Bases de Dados Genéticas/estatística & dados numéricos , Aprendizado Profundo , Feminino , Genômica/estatística & dados numéricos , Humanos , Masculino , Aprendizado de Máquina não SupervisionadoRESUMO
Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We present a novel graph definition named Hamming-Shifting graph to address this problem. The definition originates from the technological characteristics of next-generation sequencing machines, aiming to link all pairs of distinct reads that have a small Hamming distance or a small shifting offset or both. We compute multiple lexicographically minimal k-mers to index the reads for an efficient search of the weight-lightest edges, and we prove a very high probability of successfully detecting these edges. The resulted graph creates a full mutual reference of the reads to cascade a code-minimized transfer of every child-read for an optimal compression. We conducted compression experiments on the minimum spanning forest of this extremely sparse graph, and achieved a 10 - 30% more file size reduction compared to the best compression results using existing algorithms. As future work, the separation and connectivity degrees of these giant graphs can be used as economical measurements or protocols for quick quality assessment of wet-lab machines, for sufficiency control of genomic library preparation, and for accurate de novo genome assembly.
Assuntos
Algoritmos , Compressão de Dados/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Animais , Biologia Computacional , Gráficos por Computador , Compressão de Dados/estatística & dados numéricos , Bases de Dados Genéticas/estatística & dados numéricos , Genômica/estatística & dados numéricos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , HumanosRESUMO
The genetic alterations that underlie cancer development are highly tissue-specific with the majority of driving alterations occurring in only a few cancer types and with alterations common to multiple cancer types often showing a tissue-specific functional impact. This tissue-specificity means that the biology of normal tissues carries important information regarding the pathophysiology of the associated cancers, information that can be leveraged to improve the power and accuracy of cancer genomic analyses. Research exploring the use of normal tissue data for the analysis of cancer genomics has primarily focused on the paired analysis of tumor and adjacent normal samples. Efforts to leverage the general characteristics of normal tissue for cancer analysis has received less attention with most investigations focusing on understanding the tissue-specific factors that lead to individual genomic alterations or dysregulated pathways within a single cancer type. To address this gap and support scenarios where adjacent normal tissue samples are not available, we explored the genome-wide association between the transcriptomes of 21 solid human cancers and their associated normal tissues as profiled in healthy individuals. While the average gene expression profiles of normal and cancerous tissue may appear distinct, with normal tissues more similar to other normal tissues than to the associated cancer types, when transformed into relative expression values, i.e., the ratio of expression in one tissue or cancer relative to the mean in other tissues or cancers, the close association between gene activity in normal tissues and related cancers is revealed. As we demonstrate through an analysis of tumor data from The Cancer Genome Atlas and normal tissue data from the Human Protein Atlas, this association between tissue-specific and cancer-specific expression values can be leveraged to improve the prognostic modeling of cancer, the comparative analysis of different cancer types, and the analysis of cancer and normal tissue pairs.
Assuntos
Neoplasias/genética , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Feminino , Expressão Gênica , Perfilação da Expressão Gênica/estatística & dados numéricos , Humanos , Masculino , Especificidade de Órgãos/genética , Análise de Componente Principal , RNA-Seq , Valores de Referência , Análise de SobrevidaRESUMO
Effective and powerful survival mediation models are currently lacking. To partly fill such knowledge gap, we particularly focus on the mediation analysis that includes multiple DNA methylations acting as exposures, one gene expression as the mediator and one survival time as the outcome. We proposed IUSMMT (intersection-union survival mixture-adjusted mediation test) to effectively examine the existence of mediation effect by fitting an empirical three-component mixture null distribution. With extensive simulation studies, we demonstrated the advantage of IUSMMT over existing methods. We applied IUSMMT to ten TCGA cancers and identified multiple genes that exhibited mediating effects. We further revealed that most of the identified regions, in which genes behaved as active mediators, were cancer type-specific and exhibited a full mediation from DNA methylation CpG sites to the survival risk of various types of cancers. Overall, IUSMMT represents an effective and powerful alternative for survival mediation analysis; our results also provide new insights into the functional role of DNA methylation and gene expression in cancer progression/prognosis and demonstrate potential therapeutic targets for future clinical practice.
Assuntos
Metilação de DNA , Expressão Gênica , Análise de Mediação , Neoplasias/genética , Biologia Computacional , Simulação por Computador , Ilhas de CpG , Bases de Dados Genéticas/estatística & dados numéricos , Feminino , Regulação Neoplásica da Expressão Gênica , Ontologia Genética , Técnicas Genéticas , Humanos , Modelos Lineares , Masculino , Modelos Genéticos , Prognóstico , Modelos de Riscos Proporcionais , Análise de SobrevidaRESUMO
Relaxed clock models enable estimation of molecular substitution rates across lineages and are widely used in phylogenetics for dating evolutionary divergence times. Under the (uncorrelated) relaxed clock model, tree branches are associated with molecular substitution rates which are independently and identically distributed. In this article we delved into the internal complexities of the relaxed clock model in order to develop efficient MCMC operators for Bayesian phylogenetic inference. We compared three substitution rate parameterisations, introduced an adaptive operator which learns the weights of other operators during MCMC, and we explored how relaxed clock model estimation can benefit from two cutting-edge proposal kernels: the AVMVN and Bactrian kernels. This work has produced an operator scheme that is up to 65 times more efficient at exploring continuous relaxed clock parameters compared with previous setups, depending on the dataset. Finally, we explored variants of the standard narrow exchange operator which are specifically designed for the relaxed clock model. In the most extreme case, this new operator traversed tree space 40% more efficiently than narrow exchange. The methodologies introduced are adaptive and highly effective on short as well as long alignments. The results are available via the open source optimised relaxed clock (ORC) package for BEAST 2 under a GNU licence (https://github.com/jordandouglas/ORC).
Assuntos
Evolução Molecular , Modelos Genéticos , Filogenia , Algoritmos , Animais , Teorema de Bayes , Biologia Computacional , Simulação por Computador , Bases de Dados Genéticas/estatística & dados numéricos , Funções Verossimilhança , Cadeias de Markov , Método de Monte Carlo , Taxa de Mutação , Software , Fatores de TempoRESUMO
Gene expression analysis is becoming increasingly utilized in neuro-immunology research, and there is a growing need for non-programming scientists to be able to analyze their own genomic data. MGEnrichment is a web application developed both to disseminate to the community our curated database of microglia-relevant gene lists, and to allow non-programming scientists to easily conduct statistical enrichment analysis on their gene expression data. Users can upload their own gene IDs to assess the relevance of their expression data against gene lists from other studies. We include example datasets of differentially expressed genes (DEGs) from human postmortem brain samples from Autism Spectrum Disorder (ASD) and matched controls. We demonstrate how MGEnrichment can be used to expand the interpretations of these DEG lists in terms of regulation of microglial gene expression and provide novel insights into how ASD DEGs may be implicated specifically in microglial development, microbiome responses and relationships to other neuropsychiatric disorders. This tool will be particularly useful for those working in microglia, autism spectrum disorders, and neuro-immune activation research. MGEnrichment is available at https://ciernialab.shinyapps.io/MGEnrichmentApp/ and further online documentation and datasets can be found at https://github.com/ciernialab/MGEnrichmentApp. The app is released under the GNU GPLv3 open source license.
Assuntos
Perfilação da Expressão Gênica/estatística & dados numéricos , Microglia/metabolismo , Software , Animais , Transtorno do Espectro Autista/genética , Transtorno do Espectro Autista/imunologia , Encéfalo/imunologia , Encéfalo/metabolismo , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Internet , Camundongos , Microglia/imunologia , Modelos Genéticos , NeuroimunomodulaçãoRESUMO
Metabolic network models are increasingly being used in health care and industry. As a consequence, many tools have been released to automate their reconstruction process de novo. In order to enable gene deletion simulations and integration of gene expression data, these networks must include gene-protein-reaction (GPR) rules, which describe with a Boolean logic relationships between the gene products (e.g., enzyme isoforms or subunits) associated with the catalysis of a given reaction. Nevertheless, the reconstruction of GPRs still remains a largely manual and time consuming process. Aiming at fully automating the reconstruction process of GPRs for any organism, we propose the open-source python-based framework GPRuler. By mining text and data from 9 different biological databases, GPRuler can reconstruct GPRs starting either from just the name of the target organism or from an existing metabolic model. The performance of the developed tool is evaluated at small-scale level for a manually curated metabolic model, and at genome-scale level for three metabolic models related to Homo sapiens and Saccharomyces cerevisiae organisms. By exploiting these models as benchmarks, the proposed tool shown its ability to reproduce the original GPR rules with a high level of accuracy. In all the tested scenarios, after a manual investigation of the mismatches between the rules proposed by GPRuler and the original ones, the proposed approach revealed to be in many cases more accurate than the original models. By complementing existing tools for metabolic network reconstruction with the possibility to reconstruct GPRs quickly and with a few resources, GPRuler paves the way to the study of context-specific metabolic networks, representing the active portion of the complete network in given conditions, for organisms of industrial or biomedical interest that have not been characterized metabolically yet.
Assuntos
Redes e Vias Metabólicas/genética , Modelos Biológicos , Software , Biologia Computacional , Simulação por Computador , Bases de Dados Genéticas/estatística & dados numéricos , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Modelos Genéticos , Anotação de Sequência Molecular , Mapas de Interação de Proteínas/genética , Estrutura Quaternária de Proteína , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/metabolismoRESUMO
Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.
Assuntos
Sistemas de Gerenciamento de Base de Dados , Bases de Dados Genéticas/estatística & dados numéricos , Software , Animais , Classificação , Biologia Computacional , Código de Barras de DNA Taxonômico , Bases de Dados de Ácidos Nucleicos , Genômica , Humanos , Metagenoma , Metagenômica , Microbiota/genética , Filogenia , RNA Ribossômico 16S/genética , Análise de SequênciaRESUMO
The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k-mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k-mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had 2.2% error in length estimation compared to 27% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e=.
Assuntos
Algoritmos , Genoma , Genômica/estatística & dados numéricos , Sequências Repetitivas de Ácido Nucleico , Software , Animais , Biologia Computacional , Simulação por Computador , Bases de Dados Genéticas/estatística & dados numéricos , Humanos , Invertebrados/classificação , Invertebrados/genética , Análise dos Mínimos Quadrados , Modelos Lineares , Mamíferos/classificação , Mamíferos/genética , Modelos Genéticos , Filogenia , Plantas/classificação , Plantas/genética , Vertebrados/classificação , Vertebrados/genéticaRESUMO
Cancer is the result of mutagenic processes that can be inferred from tumor genomes by analyzing rate spectra of point mutations, or "mutational signatures". Here we present SparseSignatures, a novel framework to extract signatures from somatic point mutation data. Our approach incorporates a user-specified background signature, employs regularization to reduce noise in non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to large datasets. We show that SparseSignatures outperforms current state-of-the-art methods on simulated data using a variety of standard metrics. We then apply SparseSignatures to whole genome sequences of pancreatic and breast tumors, discovering well-differentiated signatures that are linked to known mutagenic mechanisms and are strongly associated with patient clinical features.