Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Resultados 1 - 20 de 659
Filtrar
1.
PLoS Comput Biol ; 20(9): e1012301, 2024 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-39226325

RESUMEN

Clustering is widely used in bioinformatics and many other fields, with applications from exploratory analysis to prediction. Many types of data have associated uncertainty or measurement error, but this is rarely used to inform the clustering. We present Dirichlet Process Mixtures with Uncertainty (DPMUnc), an extension of a Bayesian nonparametric clustering algorithm which makes use of the uncertainty associated with data points. We show that DPMUnc out-performs existing methods on simulated data. We cluster immune-mediated diseases (IMD) using GWAS summary statistics, which have uncertainty linked with the sample size of the study. DPMUnc separates autoimmune from autoinflammatory diseases and isolates other subgroups such as adult-onset arthritis. We additionally consider how DPMUnc can be used to cluster gene expression datasets that have been summarised using gene signatures. We first introduce a novel procedure for generating a summary of a gene signature on a dataset different to the one where it was discovered, which incorporates a measure of the variability in expression across signature genes within each individual. We summarise three public gene expression datasets containing patients with a range of IMD, using three relevant gene signatures. We find association between disease and the clusters returned by DPMUnc, with clustering structure replicated across the datasets. The significance of this work is two-fold. Firstly, we demonstrate that when data has associated uncertainty, this uncertainty should be used to inform clustering and we present a method which does this, DPMUnc. Secondly, we present a procedure for using gene signatures in datasets other than where they were originally defined. We show the value of this procedure by summarising gene expression data from patients with immune-mediated diseases using relevant gene signatures, and clustering these patients using DPMUnc.


Asunto(s)
Algoritmos , Teorema de Bayes , Biología Computacional , Humanos , Análisis por Conglomerados , Incertidumbre , Biología Computacional/métodos , Estudio de Asociación del Genoma Completo/métodos , Estudio de Asociación del Genoma Completo/estadística & datos numéricos , Perfilación de la Expresión Génica/estadística & datos numéricos , Perfilación de la Expresión Génica/métodos , Bases de Datos Genéticas/estadística & datos numéricos , Simulación por Computador
2.
Mol Psychiatry ; 28(3): 1293-1302, 2023 03.
Artículo en Inglés | MEDLINE | ID: mdl-36543923

RESUMEN

While genome wide association studies (GWASs) of Alzheimer's Disease (AD) in European (EUR) ancestry cohorts have identified approximately 83 potentially independent AD risk loci, progress in non-European populations has lagged. In this study, data from the Million Veteran Program (MVP), a biobank which includes genetic data from more than 650,000 US Veteran participants, was used to examine dementia genetics in an African descent (AFR) cohort. A GWAS of Alzheimer's disease and related dementias (ADRD), an expanded AD phenotype including dementias such as vascular and non-specific dementia that included 4012 cases and 18,435 controls age 60+ in AFR MVP participants was performed. A proxy dementia GWAS based on survey-reported parental AD or dementia (n = 4385 maternal cases, 2256 paternal cases, and 45,970 controls) was also performed. These two GWASs were meta-analyzed, and then subsequently compared and meta-analyzed with the results from a previous AFR AD GWAS from the Alzheimer's Disease Genetics Consortium (ADGC). A meta-analysis of common variants across the MVP ADRD and proxy GWASs yielded GWAS significant associations in the region of APOE (p = 2.48 × 10-101), in ROBO1 (rs11919682, p = 1.63 × 10-8), and RNA RP11-340A13.2 (rs148433063, p = 8.56 × 10-9). The MVP/ADGC meta-analysis yielded additional significant SNPs near known AD risk genes TREM2 (rs73427293, p = 2.95 × 10-9), CD2AP (rs7738720, p = 1.14 × 10-9), and ABCA7 (rs73505251, p = 3.26 × 10-10), although the peak variants observed in these genes differed from those previously reported in EUR and AFR cohorts. Of the genes in or near suggestive or genome-wide significant associated variants, nine (CDA, SH2D5, DCBLD1, EML6, GOPC, ABCA7, ROS1, TMCO4, and TREM2) were differentially expressed in the brains of AD cases and controls. This represents the largest AFR GWAS of AD and dementia, finding non-APOE GWAS-significant common SNPs associated with dementia. Increasing representation of AFR participants is an important priority in genetic studies and may lead to increased insight into AD pathophysiology and reduce health disparities.


Asunto(s)
Enfermedad de Alzheimer , Negro o Afroamericano , Personal Militar , Anciano , Humanos , Persona de Mediana Edad , Enfermedad de Alzheimer/epidemiología , Enfermedad de Alzheimer/etnología , Enfermedad de Alzheimer/genética , Negro o Afroamericano/genética , Negro o Afroamericano/estadística & datos numéricos , Bases de Datos Genéticas/estadística & datos numéricos , Demencia/epidemiología , Demencia/etnología , Demencia/genética , Perfilación de la Expresión Génica , Estudio de Asociación del Genoma Completo , Genotipo , Personal Militar/estadística & datos numéricos , Polimorfismo Genético , Estados Unidos/epidemiología , Predisposición Genética a la Enfermedad/epidemiología , Predisposición Genética a la Enfermedad/etnología , Predisposición Genética a la Enfermedad/genética
3.
RNA ; 27(12): 1471-1481, 2021 12.
Artículo en Inglés | MEDLINE | ID: mdl-34531327

RESUMEN

Type I toxin-antitoxin (T1TA) systems constitute a large class of genetic modules with antisense RNA (asRNA)-mediated regulation of gene expression. They are widespread in bacteria and consist of an mRNA coding for a toxic protein and a noncoding asRNA that acts as an antitoxin preventing the synthesis of the toxin by directly base-pairing to its cognate mRNA. The co- and post-transcriptional regulation of T1TA systems is intimately linked to RNA sequence and structure, therefore it is essential to have an accurate annotation of the mRNA and asRNA molecules to understand this regulation. However, most T1TA systems have been identified by means of bioinformatic analyses solely based on the toxin protein sequences, and there is no central repository of information on their specific RNA features. Here we present the first database dedicated to type I TA systems, named T1TAdb. It is an open-access web database (https://d-lab.arna.cnrs.fr/t1tadb) with a collection of ∼1900 loci in ∼500 bacterial strains in which a toxin-coding sequence has been previously identified. RNA molecules were annotated with a bioinformatic procedure based on key determinants of the mRNA structure and the genetic organization of the T1TA loci. Besides RNA and protein secondary structure predictions, T1TAdb also identifies promoter, ribosome-binding, and mRNA-asRNA interaction sites. It also includes tools for comparative analysis, such as sequence similarity search and computation of structural multiple alignments, which are annotated with covariation information. To our knowledge, T1TAdb represents the largest collection of features, sequences, and structural annotations on this class of genetic modules.


Asunto(s)
Antitoxinas/genética , Proteínas Bacterianas/genética , Biología Computacional/métodos , Bases de Datos Genéticas/estadística & datos numéricos , ARN sin Sentido/genética , Sistemas Toxina-Antitoxina/genética , Regulación Bacteriana de la Expresión Génica
4.
Nucleic Acids Res ; 49(14): 7995-8006, 2021 08 20.
Artículo en Inglés | MEDLINE | ID: mdl-34244789

RESUMEN

Though single cell RNA sequencing (scRNA-seq) technologies have been well developed, the acquisition of large-scale single cell expression data may still lead to high costs. Single cell expression profile has its inherent sparse properties, which makes it compressible, thus providing opportunities for solutions. Here, by computational simulation as well as experiment of 54 single cells, we propose that expression profiles can be compressed from the dimension of samples by overlapped assigning each cell into plenty of pools. And we prove that expression profiles can be inferred from these pool expression data with overlapped pooling design and compressed sensing strategy. We also show that by combining this approach with plate-based scRNA-seq measurement, it can maintain its superiorities in gene detection sensitivity and individual identity and recover the expression profile with high precision, while saving about half of the library cost. This method can inspire novel conceptions on the measurement, storage or computation improvements for other compressible signals in many biological areas.


Asunto(s)
Algoritmos , Simulación por Computador , Perfilación de la Expresión Génica/métodos , Modelos Teóricos , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos , Animales , Bases de Datos Genéticas/estadística & datos numéricos , Biblioteca de Genes , Humanos , Reproducibilidad de los Resultados
5.
Nucleic Acids Res ; 49(D1): D183-D191, 2021 01 08.
Artículo en Inglés | MEDLINE | ID: mdl-33068412

RESUMEN

RNA molecules fold into complex structures that are important across many biological processes. Recent technological developments have enabled transcriptome-wide probing of RNA secondary structure using nucleases and chemical modifiers. These approaches have been widely applied to capture RNA secondary structure in many studies, but gathering and presenting such data from very different technologies in a comprehensive and accessible way has been challenging. Existing RNA structure probing databases usually focus on low-throughput or very specific datasets. Here, we present a comprehensive RNA structure probing database called RASP (RNA Atlas of Structure Probing) by collecting 161 deduplicated transcriptome-wide RNA secondary structure probing datasets from 38 papers. RASP covers 18 species across animals, plants, bacteria, fungi, and also viruses, and categorizes 18 experimental methods including DMS-seq, SHAPE-Seq, SHAPE-MaP, and icSHAPE, etc. Specially, RASP curates the up-to-date datasets of several RNA secondary structure probing studies for the RNA genome of SARS-CoV-2, the RNA virus that caused the on-going COVID-19 pandemic. RASP also provides a user-friendly interface to query, browse, and visualize RNA structure profiles, offering a shortcut to accessing RNA secondary structures grounded in experimental data. The database is freely available at http://rasp.zhanglab.net.


Asunto(s)
Biología Computacional/estadística & datos numéricos , Bases de Datos Genéticas/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Conformación de Ácido Nucleico , ARN/química , Transcriptoma , Animales , COVID-19/epidemiología , COVID-19/prevención & control , COVID-19/virología , Biología Computacional/métodos , Genoma Viral/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Pandemias , ARN/genética , Sondas ARN/genética , ARN Bacteriano/química , ARN Bacteriano/genética , ARN de Hongos/química , ARN de Hongos/genética , ARN de Planta/química , ARN de Planta/genética , ARN Viral/química , ARN Viral/genética , SARS-CoV-2/genética , SARS-CoV-2/fisiología
6.
PLoS Genet ; 16(6): e1008862, 2020 06.
Artículo en Inglés | MEDLINE | ID: mdl-32569262

RESUMEN

A major challenge emerging in genomic medicine is how to assess best disease risk from rare or novel variants found in disease-related genes. The expanding volume of data generated by very large phenotyping efforts coupled to DNA sequence data presents an opportunity to reinterpret genetic liability of disease risk. Here we propose a framework to estimate the probability of disease given the presence of a genetic variant conditioned on features of that variant. We refer to this as the penetrance, the fraction of all variant heterozygotes that will present with disease. We demonstrate this methodology using a well-established disease-gene pair, the cardiac sodium channel gene SCN5A and the heart arrhythmia Brugada syndrome. From a review of 756 publications, we developed a pattern mixture algorithm, based on a Bayesian Beta-Binomial model, to generate SCN5A penetrance probabilities for the Brugada syndrome conditioned on variant-specific attributes. These probabilities are determined from variant-specific features (e.g. function, structural context, and sequence conservation) and from observations of affected and unaffected heterozygotes. Variant functional perturbation and structural context prove most predictive of Brugada syndrome penetrance.


Asunto(s)
Síndrome de Brugada/genética , Modelos Genéticos , Canal de Sodio Activado por Voltaje NAV1.5/genética , Penetrancia , Polimorfismo de Nucleótido Simple , Algoritmos , Teorema de Bayes , Distribución Binomial , Síndrome de Brugada/terapia , Bases de Datos Genéticas/estadística & datos numéricos , Conjuntos de Datos como Asunto , Humanos , Medicina de Precisión/métodos
7.
PLoS Genet ; 16(6): e1008855, 2020 06.
Artículo en Inglés | MEDLINE | ID: mdl-32542026

RESUMEN

Traditional univariate genome-wide association studies generate false positives and negatives due to difficulties distinguishing associated variants from variants with spurious nonzero effects that do not directly influence the trait. Recent efforts have been directed at identifying genes or signaling pathways enriched for mutations in quantitative traits or case-control studies, but these can be computationally costly and hampered by strict model assumptions. Here, we present gene-ε, a new approach for identifying statistical associations between sets of variants and quantitative traits. Our key insight is that enrichment studies on the gene-level are improved when we reformulate the genome-wide SNP-level null hypothesis to identify spurious small-to-intermediate SNP effects and classify them as non-causal. gene-ε efficiently identifies enriched genes under a variety of simulated genetic architectures, achieving greater than a 90% true positive rate at 1% false positive rate for polygenic traits. Lastly, we apply gene-ε to summary statistics derived from six quantitative traits using European-ancestry individuals in the UK Biobank, and identify enriched genes that are in biologically relevant pathways.


Asunto(s)
Estudio de Asociación del Genoma Completo/estadística & datos numéricos , Modelos Genéticos , Herencia Multifactorial/genética , Polimorfismo de Nucleótido Simple , Sitios de Carácter Cuantitativo/genética , Interpretación Estadística de Datos , Bases de Datos Genéticas/estadística & datos numéricos , Humanos , Reino Unido , Población Blanca/genética
8.
PLoS Genet ; 16(2): e1008572, 2020 02.
Artículo en Inglés | MEDLINE | ID: mdl-32012149

RESUMEN

Cancer genomes with mutations in the exonuclease domain of Polymerase Epsilon (POLE) present with an extraordinarily high somatic mutation burden. In vitro studies have shown that distinct POLE mutants exhibit different polymerase activity. Yet, genome-wide mutation patterns and driver mutation formation arising from different POLE mutants remains unclear. Here, we curated somatic mutation calls from 7,345 colorectal cancer samples from published studies and publicly available databases. These include 44 POLE mutant samples including 9 with whole genome sequencing data available. The POLE mutant samples were categorized based on the specific POLE mutation present. Mutation spectrum, associations of somatic mutations with epigenomics features and co-occurrence with specific driver mutations were examined across different POLE mutants. We found that different POLE mutants exhibit distinct mutation spectrum with significantly higher relative frequency of C>T mutations in POLE V411L mutants. Our analysis showed that this increase frequency in C>T mutations is not dependent on DNA methylation and not associated with other genomic features and is thus specifically due to DNA sequence context alone. Notably, we found strong association of the TP53 R213* mutation specifically with POLE P286R mutants. This truncation mutation occurs within the TT[C>T]GA context. For C>T mutations, this sequence context is significantly more likely to be mutated in POLE P286R mutants compared with other POLE exonuclease domain mutants. This study refines our understanding of DNA polymerase fidelity and underscores genome-wide mutation spectrum and specific cancer driver mutation formation observed in POLE mutant cancers.


Asunto(s)
Carcinogénesis/genética , Neoplasias Colorrectales/genética , ADN Polimerasa II/metabolismo , Proteínas de Unión a Poli-ADP-Ribosa/metabolismo , Dominios Proteicos/genética , Proteína p53 Supresora de Tumor/genética , Islas de CpG/genética , Citosina/metabolismo , Metilación de ADN/genética , Análisis Mutacional de ADN/estadística & datos numéricos , ADN Polimerasa II/genética , Bases de Datos Genéticas/estadística & datos numéricos , Conjuntos de Datos como Asunto , Epigénesis Genética , Humanos , Mutación , Proteínas de Unión a Poli-ADP-Ribosa/genética , Secuenciación Completa del Genoma/estadística & datos numéricos
9.
Hum Genet ; 141(2): 273-281, 2022 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-35048190

RESUMEN

Recombination is a major force that shapes genetic diversity. Determination of recombination rate is important and can theoretically be improved by increasing the sample size. However, it is nearly impossible to estimate recombination rates using traditional population genetics methods when the sample size is large because these methods are highly computationally demanding. In this study, we used a refined machine learning approach to estimate the recombination rate of the human genome using the UK10K human genomic dataset with 7,562 genomic sequences and its three subsets with 200, 400 and 2,000 genomic sequences. The estimation was performed under the human Out-of-Africa demographic model. We not only obtained an accurate human genetic map, but also found that the fluctuation of estimated recombination rate is reduced along the human genome when the sample size increases. The estimated UK10K recombination rate heterogeneity is less than that estimated from its subsets. Our results demonstrate how the sample size affects the estimated recombination rate, and analyses of a larger number of genomes result in a more precise estimation of recombination rate. The accurate genetic map based on UK10K data set is also expected to benefit other human biology researches.


Asunto(s)
Mapeo Cromosómico/métodos , Genoma Humano , Mapeo Cromosómico/estadística & datos numéricos , Bases de Datos Genéticas/estadística & datos numéricos , Genética de Población , Humanos , Aprendizaje Automático , Modelos Genéticos , Recombinación Genética , Tamaño de la Muestra , Programas Informáticos , Reino Unido
10.
BMC Cancer ; 22(1): 138, 2022 Feb 03.
Artículo en Inglés | MEDLINE | ID: mdl-35114976

RESUMEN

BACKGROUND: Colorectal cancer (CRC) is major cancer-related death. The aim of this study was to identify differentially expressed and differentially methylated genes, contributing to explore the molecular mechanism of CRC. METHODS: Firstly, the data of gene transcriptome and genome-wide DNA methylation expression were downloaded from the Gene Expression Omnibus database. Secondly, functional analysis of differentially expressed and differentially methylated genes was performed, followed by protein-protein interaction (PPI) analysis. Thirdly, the Cancer Genome Atlas (TCGA) dataset and in vitro experiment was used to validate the expression of selected differentially expressed and differentially methylated genes. Finally, diagnosis and prognosis analysis of selected differentially expressed and differentially methylated genes was performed. RESULTS: Up to 1958 differentially expressed (1025 up-regulated and 993 down-regulated) genes and 858 differentially methylated (800 hypermethylated and 58 hypomethylated) genes were identified. Interestingly, some genes, such as GFRA2 and MDFI, were differentially expressed-methylated genes. Purine metabolism (involved IMPDH1), cell adhesion molecules and PI3K-Akt signaling pathway were significantly enriched signaling pathways. GFRA2, FOXQ1, CDH3, CLDN1, SCGN, BEST4, CXCL12, CA7, SHMT2, TRIP13, MDFI and IMPDH1 had a diagnostic value for CRC. In addition, BEST4, SHMT2 and TRIP13 were significantly associated with patients' survival. CONCLUSIONS: The identified altered genes may be involved in tumorigenesis of CRC. In addition, BEST4, SHMT2 and TRIP13 may be considered as diagnosis and prognostic biomarkers for CRC patients.


Asunto(s)
Biomarcadores de Tumor/genética , Carcinogénesis/genética , Neoplasias Colorrectales/genética , Metilación de ADN , Regulación Neoplásica de la Expresión Génica , Neoplasias Colorrectales/diagnóstico , Neoplasias Colorrectales/patología , Bases de Datos Genéticas/estadística & datos numéricos , Conjuntos de Datos como Asunto , Femenino , Perfilación de la Expresión Génica , Humanos , Masculino , Persona de Mediana Edad , Pronóstico , Transducción de Señal , Transcriptoma
11.
PLoS Comput Biol ; 17(8): e1009224, 2021 08.
Artículo en Inglés | MEDLINE | ID: mdl-34383739

RESUMEN

Computational integrative analysis has become a significant approach in the data-driven exploration of biological problems. Many integration methods for cancer subtyping have been proposed, but evaluating these methods has become a complicated problem due to the lack of gold standards. Moreover, questions of practical importance remain to be addressed regarding the impact of selecting appropriate data types and combinations on the performance of integrative studies. Here, we constructed three classes of benchmarking datasets of nine cancers in TCGA by considering all the eleven combinations of four multi-omics data types. Using these datasets, we conducted a comprehensive evaluation of ten representative integration methods for cancer subtyping in terms of accuracy measured by combining both clustering accuracy and clinical significance, robustness, and computational efficiency. We subsequently investigated the influence of different omics data on cancer subtyping and the effectiveness of their combinations. Refuting the widely held intuition that incorporating more types of omics data always produces better results, our analyses showed that there are situations where integrating more omics data negatively impacts the performance of integration methods. Our analyses also suggested several effective combinations for most cancers under our studies, which may be of particular interest to researchers in omics data analysis.


Asunto(s)
Biología Computacional/métodos , Neoplasias/clasificación , Neoplasias/genética , Algoritmos , Biomarcadores de Tumor/genética , Interpretación Estadística de Datos , Bases de Datos Genéticas/estadística & datos numéricos , Aprendizaje Profundo , Femenino , Genómica/estadística & datos numéricos , Humanos , Masculino , Aprendizaje Automático no Supervisado
12.
PLoS Comput Biol ; 17(6): e1009085, 2021 06.
Artículo en Inglés | MEDLINE | ID: mdl-34143767

RESUMEN

The genetic alterations that underlie cancer development are highly tissue-specific with the majority of driving alterations occurring in only a few cancer types and with alterations common to multiple cancer types often showing a tissue-specific functional impact. This tissue-specificity means that the biology of normal tissues carries important information regarding the pathophysiology of the associated cancers, information that can be leveraged to improve the power and accuracy of cancer genomic analyses. Research exploring the use of normal tissue data for the analysis of cancer genomics has primarily focused on the paired analysis of tumor and adjacent normal samples. Efforts to leverage the general characteristics of normal tissue for cancer analysis has received less attention with most investigations focusing on understanding the tissue-specific factors that lead to individual genomic alterations or dysregulated pathways within a single cancer type. To address this gap and support scenarios where adjacent normal tissue samples are not available, we explored the genome-wide association between the transcriptomes of 21 solid human cancers and their associated normal tissues as profiled in healthy individuals. While the average gene expression profiles of normal and cancerous tissue may appear distinct, with normal tissues more similar to other normal tissues than to the associated cancer types, when transformed into relative expression values, i.e., the ratio of expression in one tissue or cancer relative to the mean in other tissues or cancers, the close association between gene activity in normal tissues and related cancers is revealed. As we demonstrate through an analysis of tumor data from The Cancer Genome Atlas and normal tissue data from the Human Protein Atlas, this association between tissue-specific and cancer-specific expression values can be leveraged to improve the prognostic modeling of cancer, the comparative analysis of different cancer types, and the analysis of cancer and normal tissue pairs.


Asunto(s)
Neoplasias/genética , Biología Computacional , Bases de Datos Genéticas/estadística & datos numéricos , Femenino , Expresión Génica , Perfilación de la Expresión Génica/estadística & datos numéricos , Humanos , Masculino , Especificidad de Órganos/genética , Análisis de Componente Principal , RNA-Seq , Valores de Referencia , Análisis de Supervivencia
13.
PLoS Comput Biol ; 17(8): e1009250, 2021 08.
Artículo en Inglés | MEDLINE | ID: mdl-34464378

RESUMEN

Effective and powerful survival mediation models are currently lacking. To partly fill such knowledge gap, we particularly focus on the mediation analysis that includes multiple DNA methylations acting as exposures, one gene expression as the mediator and one survival time as the outcome. We proposed IUSMMT (intersection-union survival mixture-adjusted mediation test) to effectively examine the existence of mediation effect by fitting an empirical three-component mixture null distribution. With extensive simulation studies, we demonstrated the advantage of IUSMMT over existing methods. We applied IUSMMT to ten TCGA cancers and identified multiple genes that exhibited mediating effects. We further revealed that most of the identified regions, in which genes behaved as active mediators, were cancer type-specific and exhibited a full mediation from DNA methylation CpG sites to the survival risk of various types of cancers. Overall, IUSMMT represents an effective and powerful alternative for survival mediation analysis; our results also provide new insights into the functional role of DNA methylation and gene expression in cancer progression/prognosis and demonstrate potential therapeutic targets for future clinical practice.


Asunto(s)
Metilación de ADN , Expresión Génica , Análisis de Mediación , Neoplasias/genética , Biología Computacional , Simulación por Computador , Islas de CpG , Bases de Datos Genéticas/estadística & datos numéricos , Femenino , Regulación Neoplásica de la Expresión Génica , Ontología de Genes , Técnicas Genéticas , Humanos , Modelos Lineales , Masculino , Modelos Genéticos , Pronóstico , Modelos de Riesgos Proporcionales , Análisis de Supervivencia
14.
PLoS Comput Biol ; 17(7): e1009229, 2021 07.
Artículo en Inglés | MEDLINE | ID: mdl-34280186

RESUMEN

Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We present a novel graph definition named Hamming-Shifting graph to address this problem. The definition originates from the technological characteristics of next-generation sequencing machines, aiming to link all pairs of distinct reads that have a small Hamming distance or a small shifting offset or both. We compute multiple lexicographically minimal k-mers to index the reads for an efficient search of the weight-lightest edges, and we prove a very high probability of successfully detecting these edges. The resulted graph creates a full mutual reference of the reads to cascade a code-minimized transfer of every child-read for an optimal compression. We conducted compression experiments on the minimum spanning forest of this extremely sparse graph, and achieved a 10 - 30% more file size reduction compared to the best compression results using existing algorithms. As future work, the separation and connectivity degrees of these giant graphs can be used as economical measurements or protocols for quick quality assessment of wet-lab machines, for sufficiency control of genomic library preparation, and for accurate de novo genome assembly.


Asunto(s)
Algoritmos , Compresión de Datos/métodos , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Animales , Biología Computacional , Gráficos por Computador , Compresión de Datos/estadística & datos numéricos , Bases de Datos Genéticas/estadística & datos numéricos , Genómica/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Humanos
15.
PLoS Comput Biol ; 17(2): e1008322, 2021 02.
Artículo en Inglés | MEDLINE | ID: mdl-33529184

RESUMEN

Relaxed clock models enable estimation of molecular substitution rates across lineages and are widely used in phylogenetics for dating evolutionary divergence times. Under the (uncorrelated) relaxed clock model, tree branches are associated with molecular substitution rates which are independently and identically distributed. In this article we delved into the internal complexities of the relaxed clock model in order to develop efficient MCMC operators for Bayesian phylogenetic inference. We compared three substitution rate parameterisations, introduced an adaptive operator which learns the weights of other operators during MCMC, and we explored how relaxed clock model estimation can benefit from two cutting-edge proposal kernels: the AVMVN and Bactrian kernels. This work has produced an operator scheme that is up to 65 times more efficient at exploring continuous relaxed clock parameters compared with previous setups, depending on the dataset. Finally, we explored variants of the standard narrow exchange operator which are specifically designed for the relaxed clock model. In the most extreme case, this new operator traversed tree space 40% more efficiently than narrow exchange. The methodologies introduced are adaptive and highly effective on short as well as long alignments. The results are available via the open source optimised relaxed clock (ORC) package for BEAST 2 under a GNU licence (https://github.com/jordandouglas/ORC).


Asunto(s)
Evolución Molecular , Modelos Genéticos , Filogenia , Algoritmos , Animales , Teorema de Bayes , Biología Computacional , Simulación por Computador , Bases de Datos Genéticas/estadística & datos numéricos , Funciones de Verosimilitud , Cadenas de Markov , Método de Montecarlo , Tasa de Mutación , Programas Informáticos , Factores de Tiempo
16.
PLoS Comput Biol ; 17(11): e1009550, 2021 11.
Artículo en Inglés | MEDLINE | ID: mdl-34748537

RESUMEN

Metabolic network models are increasingly being used in health care and industry. As a consequence, many tools have been released to automate their reconstruction process de novo. In order to enable gene deletion simulations and integration of gene expression data, these networks must include gene-protein-reaction (GPR) rules, which describe with a Boolean logic relationships between the gene products (e.g., enzyme isoforms or subunits) associated with the catalysis of a given reaction. Nevertheless, the reconstruction of GPRs still remains a largely manual and time consuming process. Aiming at fully automating the reconstruction process of GPRs for any organism, we propose the open-source python-based framework GPRuler. By mining text and data from 9 different biological databases, GPRuler can reconstruct GPRs starting either from just the name of the target organism or from an existing metabolic model. The performance of the developed tool is evaluated at small-scale level for a manually curated metabolic model, and at genome-scale level for three metabolic models related to Homo sapiens and Saccharomyces cerevisiae organisms. By exploiting these models as benchmarks, the proposed tool shown its ability to reproduce the original GPR rules with a high level of accuracy. In all the tested scenarios, after a manual investigation of the mismatches between the rules proposed by GPRuler and the original ones, the proposed approach revealed to be in many cases more accurate than the original models. By complementing existing tools for metabolic network reconstruction with the possibility to reconstruct GPRs quickly and with a few resources, GPRuler paves the way to the study of context-specific metabolic networks, representing the active portion of the complete network in given conditions, for organisms of industrial or biomedical interest that have not been characterized metabolically yet.


Asunto(s)
Redes y Vías Metabólicas/genética , Modelos Biológicos , Programas Informáticos , Biología Computacional , Simulación por Computador , Bases de Datos Genéticas/estadística & datos numéricos , Bases de Datos de Proteínas/estadística & datos numéricos , Humanos , Modelos Genéticos , Anotación de Secuencia Molecular , Mapas de Interacción de Proteínas/genética , Estructura Cuaternaria de Proteína , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/metabolismo
17.
PLoS Comput Biol ; 17(11): e1009581, 2021 11.
Artículo en Inglés | MEDLINE | ID: mdl-34748542

RESUMEN

Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.


Asunto(s)
Sistemas de Administración de Bases de Datos , Bases de Datos Genéticas/estadística & datos numéricos , Programas Informáticos , Animales , Clasificación , Biología Computacional , Código de Barras del ADN Taxonómico , Bases de Datos de Ácidos Nucleicos , Genómica , Humanos , Metagenoma , Metagenómica , Microbiota/genética , Filogenia , ARN Ribosómico 16S/genética , Análisis de Secuencia
18.
PLoS Comput Biol ; 17(11): e1009160, 2021 11.
Artículo en Inglés | MEDLINE | ID: mdl-34788279

RESUMEN

Gene expression analysis is becoming increasingly utilized in neuro-immunology research, and there is a growing need for non-programming scientists to be able to analyze their own genomic data. MGEnrichment is a web application developed both to disseminate to the community our curated database of microglia-relevant gene lists, and to allow non-programming scientists to easily conduct statistical enrichment analysis on their gene expression data. Users can upload their own gene IDs to assess the relevance of their expression data against gene lists from other studies. We include example datasets of differentially expressed genes (DEGs) from human postmortem brain samples from Autism Spectrum Disorder (ASD) and matched controls. We demonstrate how MGEnrichment can be used to expand the interpretations of these DEG lists in terms of regulation of microglial gene expression and provide novel insights into how ASD DEGs may be implicated specifically in microglial development, microbiome responses and relationships to other neuropsychiatric disorders. This tool will be particularly useful for those working in microglia, autism spectrum disorders, and neuro-immune activation research. MGEnrichment is available at https://ciernialab.shinyapps.io/MGEnrichmentApp/ and further online documentation and datasets can be found at https://github.com/ciernialab/MGEnrichmentApp. The app is released under the GNU GPLv3 open source license.


Asunto(s)
Perfilación de la Expresión Génica/estadística & datos numéricos , Microglía/metabolismo , Programas Informáticos , Animales , Trastorno del Espectro Autista/genética , Trastorno del Espectro Autista/inmunología , Encéfalo/inmunología , Encéfalo/metabolismo , Biología Computacional , Bases de Datos Genéticas/estadística & datos numéricos , Internet , Ratones , Microglía/inmunología , Modelos Genéticos , Neuroinmunomodulación
19.
PLoS Comput Biol ; 17(11): e1009449, 2021 11.
Artículo en Inglés | MEDLINE | ID: mdl-34780468

RESUMEN

The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k-mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k-mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had 2.2% error in length estimation compared to 27% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e=.


Asunto(s)
Algoritmos , Genoma , Genómica/estadística & datos numéricos , Secuencias Repetitivas de Ácidos Nucleicos , Programas Informáticos , Animales , Biología Computacional , Simulación por Computador , Bases de Datos Genéticas/estadística & datos numéricos , Humanos , Invertebrados/clasificación , Invertebrados/genética , Análisis de los Mínimos Cuadrados , Modelos Lineales , Mamíferos/clasificación , Mamíferos/genética , Modelos Genéticos , Filogenia , Plantas/clasificación , Plantas/genética , Vertebrados/clasificación , Vertebrados/genética
20.
PLoS Comput Biol ; 17(6): e1009119, 2021 06.
Artículo en Inglés | MEDLINE | ID: mdl-34181655

RESUMEN

Cancer is the result of mutagenic processes that can be inferred from tumor genomes by analyzing rate spectra of point mutations, or "mutational signatures". Here we present SparseSignatures, a novel framework to extract signatures from somatic point mutation data. Our approach incorporates a user-specified background signature, employs regularization to reduce noise in non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to large datasets. We show that SparseSignatures outperforms current state-of-the-art methods on simulated data using a variety of standard metrics. We then apply SparseSignatures to whole genome sequences of pancreatic and breast tumors, discovering well-differentiated signatures that are linked to known mutagenic mechanisms and are strongly associated with patient clinical features.


Asunto(s)
Análisis Mutacional de ADN/estadística & datos numéricos , Neoplasias/genética , Mutación Puntual , Algoritmos , Biomarcadores de Tumor/genética , Neoplasias de la Mama/clasificación , Neoplasias de la Mama/genética , Biología Computacional , Simulación por Computador , Bases de Datos Genéticas/estadística & datos numéricos , Femenino , Genes BRCA1 , Genes BRCA2 , Genoma Humano , Humanos , Neoplasias Pancreáticas/clasificación , Neoplasias Pancreáticas/genética , Programas Informáticos
SELECCIÓN DE REFERENCIAS
Detalles de la búsqueda