Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
1.
Bioinformatics ; 39(1)2023 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-36648314

RESUMEN

MOTIVATION: Timetrees depict evolutionary relationships between species and the geological times of their divergence. Hundreds of research articles containing timetrees are published in scientific journals every year. The TimeTree (TT) project has been manually locating, curating and synthesizing timetrees from these articles for almost two decades into a TimeTree of Life, delivered through a unique, user-friendly web interface (timetree.org). The manual process of finding articles containing timetrees is becoming increasingly expensive and time-consuming. So, we have explored the effectiveness of text-mining approaches and developed optimizations to find research articles containing timetrees automatically. RESULTS: We have developed an optimized machine learning system to determine if a research article contains an evolutionary timetree appropriate for inclusion in the TT resource. We found that BERT classification fine-tuned on whole-text articles achieved an F1 score of 0.67, which we increased to 0.88 by text-mining article excerpts surrounding the mentioning of figures. The new method is implemented in the TimeTreeFinder (TTF) tool, which automatically processes millions of articles to discover timetree-containing articles. We estimate that the TTF tool would produce twice as many timetree-containing articles as those discovered manually, whose inclusion in the TT database would potentially double the knowledge accessible to a wider community. Manual inspection showed that the precision on out-of-distribution recently published articles is 87%. This automation will speed up the collection and curation of timetrees with much lower human and time costs. AVAILABILITY AND IMPLEMENTATION: https://github.com/marija-stanojevic/time-tree-classification. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Evolución Biológica , Minería de Datos , Humanos , Filogenia , Bases de Datos Factuales , Aprendizaje Automático
2.
BMC Bioinformatics ; 22(1): 224, 2021 May 01.
Artículo en Inglés | MEDLINE | ID: mdl-33932985

RESUMEN

BACKGROUND: RNA sequencing (RNA-seq) is a common and widespread biological assay, and an increasing amount of data is generated with it. In practice, there are a large number of individual steps a researcher must perform before raw RNA-seq reads yield directly valuable information, such as differential gene expression data. Existing software tools are typically specialized, only performing one step-such as alignment of reads to a reference genome-of a larger workflow. The demand for a more comprehensive and reproducible workflow has led to the production of a number of publicly available RNA-seq pipelines. However, we have found that most require computational expertise to set up or share among several users, are not actively maintained, or lack features we have found to be important in our own analyses. RESULTS: In response to these concerns, we have developed a Scalable Pipeline for Expression Analysis and Quantification (SPEAQeasy), which is easy to install and share, and provides a bridge towards R/Bioconductor downstream analysis solutions. SPEAQeasy is portable across computational frameworks (SGE, SLURM, local, docker integration) and different configuration files are provided ( http://research.libd.org/SPEAQeasy/ ). CONCLUSIONS: SPEAQeasy is user-friendly and lowers the computational-domain entry barrier for biologists and clinicians to RNA-seq data processing as the main input file is a table with sample names and their corresponding FASTQ files. The goal is to provide a flexible pipeline that is immediately usable by researchers, regardless of their technical background or computing environment.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos , RNA-Seq , Análisis de Secuencia de ARN , Flujo de Trabajo
3.
Mol Biol Evol ; 37(6): 1819-1831, 2020 06 01.
Artículo en Inglés | MEDLINE | ID: mdl-32119075

RESUMEN

The conventional wisdom in molecular evolution is to apply parameter-rich models of nucleotide and amino acid substitutions for estimating divergence times. However, the actual extent of the difference between time estimates produced by highly complex models compared with those from simple models is yet to be quantified for contemporary data sets that frequently contain sequences from many species and genes. In a reanalysis of many large multispecies alignments from diverse groups of taxa, we found that the use of the simplest models can produce divergence time estimates and credibility intervals similar to those obtained from the complex models applied in the original studies. This result is surprising because the use of simple models underestimates sequence divergence for all the data sets analyzed. We found three fundamental reasons for the observed robustness of time estimates to model complexity in many practical data sets. First, the estimates of branch lengths and node-to-tip distances under the simplest model show an approximately linear relationship with those produced by using the most complex models applied on data sets with many sequences. Second, relaxed clock methods automatically adjust rates on branches that experience considerable underestimation of sequence divergences, resulting in time estimates that are similar to those from complex models. And, third, the inclusion of even a few good calibrations in an analysis can reduce the difference in time estimates from simple and complex models. The robustness of time estimates to model complexity in these empirical data analyses is encouraging, because all phylogenomics studies use statistical models that are oversimplified descriptions of actual evolutionary substitution processes.


Asunto(s)
Evolución Molecular , Genómica/métodos , Modelos Genéticos , Filogenia , Plantas/genética
4.
PLoS Comput Biol ; 16(1): e1007046, 2020 01.
Artículo en Inglés | MEDLINE | ID: mdl-31951607

RESUMEN

Pathogen timetrees are phylogenies scaled to time. They reveal the temporal history of a pathogen spread through the populations as captured in the evolutionary history of strains. These timetrees are inferred by using molecular sequences of pathogenic strains sampled at different times. That is, temporally sampled sequences enable the inference of sequence divergence times. Here, we present a new approach (RelTime with Dated Tips [RTDT]) to estimating pathogen timetrees based on a relative rate framework underlying the RelTime approach that is algebraic in nature and distinct from all other current methods. RTDT does not require many of the priors demanded by Bayesian approaches, and it has light computing requirements. In analyses of an extensive collection of computer-simulated datasets, we found the accuracy of RTDT time estimates and the coverage probabilities of their confidence intervals (CIs) to be excellent. In analyses of empirical datasets, RTDT produced dates that were similar to those reported in the literature. In comparative benchmarking with Bayesian and non-Bayesian methods (LSD, TreeTime, and treedater), we found that no method performed the best in every scenario. So, we provide a brief guideline for users to select the most appropriate method in empirical data analysis. RTDT is implemented for use via a graphical user interface and in high-throughput settings in the newest release of cross-platform MEGA X software, freely available from http://www.megasoftware.net.


Asunto(s)
Biología Computacional/métodos , Evolución Molecular , Filogenia , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Algoritmos , Animales , Humanos , Programas Informáticos , Virosis/virología , Virus/clasificación , Virus/genética
5.
Bioinformatics ; 34(23): 4017-4026, 2018 12 01.
Artículo en Inglés | MEDLINE | ID: mdl-29931046

RESUMEN

Motivation: Analyses of data generated from bulk sequencing of tumors have revealed extensive genomic heterogeneity within patients. Many computational methods have been developed to enable the inference of genotypes of tumor cell populations (clones) from bulk sequencing data. However, the relative and absolute accuracy of available computational methods in estimating clone counts and clone genotypes is not yet known. Results: We have assessed the performance of nine methods, including eight previously-published and one new method (CloneFinder), by analyzing computer simulated datasets. CloneFinder, LICHeE, CITUP and cloneHD inferred clone genotypes with low error (<5% per clone) for a majority of datasets in which the tumor samples contained evolutionarily-related clones. Computational methods did not perform well for datasets in which tumor samples contained mixtures of clones from different clonal lineages. Generally, the number of clones was underestimated by cloneHD and overestimated by PhyloWGS, and BayClone2, Canopy and Clomial required prior information regarding the number of clones. AncesTree and Canopy did not produce results for a large number of datasets. Overall, the deconvolution of clone genotypes from single nucleotide variant (SNV) frequency differences among tumor samples remains challenging, so there is a need to develop more accurate computational methods and robust software for clone genotype inference. Availability and implementation: CloneFinder is implemented in Python and is available from https://github.com/gstecher/CloneFinderAPI. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Células Clonales , Genotipo , Neoplasias/genética , Programas Informáticos , Biología Computacional , Simulación por Computador , Humanos , Polimorfismo de Nucleótido Simple
6.
Bioinformatics ; 34(17): i917-i926, 2018 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-30423071

RESUMEN

Motivation: Tumor sequencing has entered an exciting phase with the advent of single-cell techniques that are revolutionizing the assessment of single nucleotide variation (SNV) at the highest cellular resolution. However, state-of-the-art single-cell sequencing technologies produce data with many missing bases (MBs) and incorrect base designations that lead to false-positive (FP) and false-negative (FN) detection of somatic mutations. While computational methods are available to make biological inferences in the presence of these errors, the accuracy of the imputed MBs and corrected FPs and FNs remains unknown. Results: Using computer simulated datasets, we assessed the robustness performance of four existing methods (OncoNEM, SCG, SCITE and SiFit) and one new method (BEAM). BEAM is a Bayesian evolution-aware method that improves the quality of single-cell sequences by using the intrinsic evolutionary information in the single-cell data in a molecular phylogenetic framework. Overall, BEAM and SCITE performed the best. Most of the methods imputed MBs with high accuracy, but effective detection and correction of FPs and FNs is a challenge, especially for small datasets. Analysis of an empirical dataset shows that computational methods can improve both the quality of tumor single-cell sequences and their utility for biological inference. In conclusion, tumor cells descend from pre-existing cells, which creates evolutionary continuity in single-cell sequencing datasets. This information enables BEAM and other methods to correctly impute missing data and incorrect base assignments, but correction of FPs and FNs remains challenging when the number of SNVs sampled is small relative to the number of cells sequenced. Availability and implementation: BEAM is available on the web at https://github.com/SayakaMiura/BEAM.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Neoplasias/genética , Análisis de la Célula Individual/métodos , Teorema de Bayes , Humanos , Filogenia
7.
BMC Cancer ; 18(1): 85, 2018 01 18.
Artículo en Inglés | MEDLINE | ID: mdl-29347918

RESUMEN

BACKGROUND: A unified analysis of DNA sequences from hundreds of tumors concluded that the driver mutations primarily occur in the earliest stages of cancer formation, with relatively few driver mutation events detected in the late-arising subclones. However, emerging evidence from the sequencing of multiple tumors and tumor regions per individual suggests that late-arising subclones with additional driver mutations are underestimated in single-sample analyses. METHODS: To test whether driver mutations generally map to early tumor development, we examined multi-regional tumor sequencing data from 101 individuals reported in 11 published studies. Following previous studies, we annotated mutations as early-arising when all tumors/regions had those mutations (ubiquitous). We then inferred the fraction of mutations occurring early and compared it with late-arising mutations that were found in only single tumors/regions. RESULTS: While a large fraction of driver mutations in tumors occurred relatively early in cancers, later driver mutations occurred at least as frequently as the early drivers in a substantial number of patients. This result was robust to many different approaches to annotate driver mutations. The relative frequency of early and late driver mutations varied among patients of the same cancer type and in different cancer types. We found that previous reports of the preponderance of early driver mutations were primarily informed by analysis of single tumor variant allele profiles, with which it is challenging to clearly distinguish between early and late drivers. CONCLUSIONS: The origin and preponderance of new driver mutations are not limited to early stages of tumor evolution, with different tumors and regions showing distinct driver mutations and, consequently, distinct characteristics. Therefore, tumors with extensive intratumor heterogeneity appear to have many newly acquired drivers.


Asunto(s)
Secuencia de Bases/genética , Carcinogénesis/genética , Evolución Clonal/genética , Neoplasias/genética , Heterogeneidad Genética , Humanos , Mutación/genética , Neoplasias/patología , Análisis de Secuencia de ADN
9.
Science ; 384(6698): eadh3707, 2024 May 24.
Artículo en Inglés | MEDLINE | ID: mdl-38781393

RESUMEN

The molecular pathology of stress-related disorders remains elusive. Our brain multiregion, multiomic study of posttraumatic stress disorder (PTSD) and major depressive disorder (MDD) included the central nucleus of the amygdala, hippocampal dentate gyrus, and medial prefrontal cortex (mPFC). Genes and exons within the mPFC carried most disease signals replicated across two independent cohorts. Pathways pointed to immune function, neuronal and synaptic regulation, and stress hormones. Multiomic factor and gene network analyses provided the underlying genomic structure. Single nucleus RNA sequencing in dorsolateral PFC revealed dysregulated (stress-related) signals in neuronal and non-neuronal cell types. Analyses of brain-blood intersections in >50,000 UK Biobank participants were conducted along with fine-mapping of the results of PTSD and MDD genome-wide association studies to distinguish risk from disease processes. Our data suggest shared and distinct molecular pathology in both disorders and propose potential therapeutic targets and biomarkers.


Asunto(s)
Encéfalo , Trastorno Depresivo Mayor , Sitios Genéticos , Trastornos por Estrés Postraumático , Femenino , Humanos , Masculino , Amígdala del Cerebelo/metabolismo , Biomarcadores/metabolismo , Encéfalo/metabolismo , Trastorno Depresivo Mayor/genética , Redes Reguladoras de Genes , Estudio de Asociación del Genoma Completo , Neuronas/metabolismo , Corteza Prefrontal/metabolismo , Trastornos por Estrés Postraumático/genética , Biología de Sistemas , Análisis de Expresión Génica de una Sola Célula , Mapeo Cromosómico
10.
Am J Psychiatry ; 179(3): 226-241, 2022 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-35236118

RESUMEN

OBJECTIVE: The authors sought to study the transcriptomic and genomic features of completed suicide by parsing the method chosen, to capture molecular correlates of the distinctive frame of mind of individuals who die by suicide, while reducing heterogeneity. METHODS: The authors analyzed gene expression (RNA sequencing) from postmortem dorsolateral prefrontal cortex of patients who died by suicide with violent compared with nonviolent means, nonsuicide patients with the same psychiatric disorders, and a neurotypical group (total N=329). They then examined genomic risk scores (GRSs) for each psychiatric disorder included, and GRSs for cognition (IQ) and for suicide attempt, testing how they predict diagnosis or traits (total N=888). RESULTS: Patients who died by suicide by violent means showed a transcriptomic pattern remarkably divergent from each of the other patient groups but less from the neurotypical group; consistently, their genomic profile of risk was relatively low for their diagnosed illness as well as for suicide attempt, and relatively high for IQ: they were more similar to the neurotypical group than to other patients. Differentially expressed genes (DEGs) associated with patients who died by suicide by violent means pointed to purinergic signaling in microglia, showing similarities to a genome-wide association study of Drosophila aggression. Weighted gene coexpression network analysis revealed that these DEGs were coexpressed in a context of mitochondrial metabolic activation unique to suicide by violent means. CONCLUSIONS: These findings suggest that patients who die by suicide by violent means are in part biologically separable from other patients with the same diagnoses, and their behavioral outcome may be less dependent on genetic risk for conventional psychiatric disorders and be associated with an alteration of purinergic signaling and mitochondrial metabolism.


Asunto(s)
Suicidio Completo , Encéfalo , Estudio de Asociación del Genoma Completo , Humanos , Transcriptoma/genética , Violencia/psicología
11.
Neuron ; 109(19): 3088-3103.e5, 2021 10 06.
Artículo en Inglés | MEDLINE | ID: mdl-34582785

RESUMEN

Single-cell gene expression technologies are powerful tools to study cell types in the human brain, but efforts have largely focused on cortical brain regions. We therefore created a single-nucleus RNA-sequencing resource of 70,615 high-quality nuclei to generate a molecular taxonomy of cell types across five human brain regions that serve as key nodes of the human brain reward circuitry: nucleus accumbens, amygdala, subgenual anterior cingulate cortex, hippocampus, and dorsolateral prefrontal cortex. We first identified novel subpopulations of interneurons and medium spiny neurons (MSNs) in the nucleus accumbens and further characterized robust GABAergic inhibitory cell populations in the amygdala. Joint analyses across the 107 reported cell classes revealed cell-type substructure and unique patterns of transcriptomic dynamics. We identified discrete subpopulations of D1- and D2-expressing MSNs in the nucleus accumbens to which we mapped cell-type-specific enrichment for genetic risk associated with both psychiatric disease and addiction.


Asunto(s)
Encéfalo/fisiología , Núcleo Celular/genética , Núcleo Celular/fisiología , Perfilación de la Expresión Génica , Red Nerviosa/fisiología , Recompensa , Mapeo Encefálico , Estudio de Asociación del Genoma Completo , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Interneuronas/fisiología , Trastornos Mentales/genética , Neuronas/fisiología , Análisis de Secuencia de ARN , Trastornos Relacionados con Sustancias/genética , Ácido gamma-Aminobutírico/fisiología
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA