Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 77
Filtrar
1.
bioRxiv ; 2024 Mar 13.
Artículo en Inglés | MEDLINE | ID: mdl-38559266

RESUMEN

Tens of thousands of RNA-sequencing experiments comprising hundreds of thousands of individual samples have now been performed. These data represent a broad range of experimental conditions, sequencing technologies, and hypotheses under study. The Recount project has aggregated and uniformly processed hundreds of thousands of publicly available RNA-seq samples. Most of these samples only include RNA expression measurements; genotype data for these same samples would enable a wide range of analyses including variant prioritization, eQTL analysis, and studies of allele specific expression. Here, we developed a statistical model based on the existing reference and alternative read counts from the RNA-seq experiments available through Recount3 to predict genotypes at autosomal biallelic loci in coding regions. We demonstrate the accuracy of our model using large-scale studies that measured both gene expression and genotype genome-wide. We show that our predictive model is highly accurate with 99.5% overall accuracy, 99.6% major allele accuracy, and 90.4% minor allele accuracy. Our model is robust to tissue and study effects, provided the coverage is high enough. We applied this model to genotype all the samples in Recount 3 and provide the largest ready-to-use expression repository containing genotype information. We illustrate that the predicted genotype from RNA-seq data is sufficient to unravel the underlying population structure of samples in Recount3 using Principal Component Analysis.

3.
ArXiv ; 2023 Jun 05.
Artículo en Inglés | MEDLINE | ID: mdl-37332562

RESUMEN

Software is vital for the advancement of biology and medicine. Through analysis of usage and impact metrics of software, developers can help determine user and community engagement. These metrics can be used to justify additional funding, encourage additional use, and identify unanticipated use cases. Such analyses can help define improvement areas and assist with managing project resources. However, there are challenges associated with assessing usage and impact, many of which vary widely depending on the type of software being evaluated. These challenges involve issues of distorted, exaggerated, understated, or misleading metrics, as well as ethical and security concerns. More attention to the nuances, challenges, and considerations involved in capturing impact across the diverse spectrum of biological software is needed. Furthermore, some tools may be especially beneficial to a small audience, yet may not have comparatively compelling metrics of high usage. Although some principles are generally applicable, there is not a single perfect metric or approach to effectively evaluate a software tool's impact, as this depends on aspects unique to each tool, how it is used, and how one wishes to evaluate engagement. We propose more broadly applicable guidelines (such as infrastructure that supports the usage of software and the collection of metrics about usage), as well as strategies for various types of software and resources. We also highlight outstanding issues in the field regarding how communities measure or evaluate software impact. To gain a deeper understanding of the issues hindering software evaluations, as well as to determine what appears to be helpful, we performed a survey of participants involved with scientific software projects for the Informatics Technology for Cancer Research (ITCR) program funded by the National Cancer Institute (NCI). We also investigated software among this scientific community and others to assess how often infrastructure that supports such evaluations is implemented and how this impacts rates of papers describing usage of the software. We find that although developers recognize the utility of analyzing data related to the impact or usage of their software, they struggle to find the time or funding to support such analyses. We also find that infrastructure such as social media presence, more in-depth documentation, the presence of software health metrics, and clear information on how to contact developers seem to be associated with increased usage rates. Our findings can help scientific software developers make the most out of the evaluations of their software so that they can more fully benefit from such assessments.

4.
J Stat Data Sci Educ ; 31(1): 57-65, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37207236

RESUMEN

Data science and informatics tools are developing at a blistering rate, but their users often lack the educational background or resources to efficiently apply the methods to their research. Training resources and vignettes that accompany these tools often deprecate because their maintenance is not prioritized by funding, giving teams little time to devote to such endeavors. Our group has developed Open-source Tools for Training Resources (OTTR) to offer greater efficiency and flexibility for creating and maintaining these training resources. OTTR empowers creators to customize their work and allows for a simple workflow to publish using multiple platforms. OTTR allows content creators to publish training material to multiple massive online learner communities using familiar rendering mechanics. OTTR allows the incorporation of pedagogical practices like formative and summative assessments in the form of multiple choice questions and fill in the blank problems that are automatically graded. No local installation of any software is required to begin creating content with OTTR. Thus far, 15 training courses have been created with OTTR repository template. By using the OTTR system, the maintenance workload for updating these courses across platforms has been drastically reduced. For more information about OTTR and how to get started, go to ottrproject.org.

5.
F1000Res ; 12: 1240, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-38764793

RESUMEN

Data science education provides tremendous opportunities but remains inaccessible to many communities. Increasing the accessibility of data science to these communities not only benefits the individuals entering data science, but also increases the field's innovation and potential impact as a whole. Education is the most scalable solution to meet these needs, but many data science educators lack formal training in education. Our group has led education efforts for a variety of audiences: from professional scientists to high school students to lay audiences. These experiences have helped form our teaching philosophy which we have summarized into three main ideals: 1) motivation, 2) inclusivity, and 3) realism. 20 we also aim to iteratively update our teaching approaches and curriculum as we find ways to better reach these ideals. In this manuscript we discuss these ideals as well practical ideas for how to implement these philosophies in the classroom.


Asunto(s)
Ciencia de los Datos , Motivación , Humanos , Ciencia de los Datos/educación , Curriculum , Enseñanza
6.
Cell Genom ; 2(1)2022 Jan 12.
Artículo en Inglés | MEDLINE | ID: mdl-35199087

RESUMEN

The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL; https://anvilproject.org) was developed to address a widespread community need for a unified computing environment for genomics data storage, management, and analysis. In this perspective, we present AnVIL, describe its ecosystem and interoperability with other platforms, and highlight how this platform and associated initiatives contribute to improved genomic data sharing efforts. The AnVIL is a federated cloud platform designed to manage and store genomics and related data, enable population-scale analysis, and facilitate collaboration through the sharing of data, code, and analysis results. By inverting the traditional model of data sharing, the AnVIL eliminates the need for data movement while also adding security measures for active threat detection and monitoring and provides scalable, shared computing resources for any researcher. We describe the core data management and analysis components of the AnVIL, which currently consists of Terra, Gen3, Galaxy, RStudio/Bioconductor, Dockstore, and Jupyter, and describe several flagship genomics datasets available within the AnVIL. We continue to extend and innovate the AnVIL ecosystem by implementing new capabilities, including mechanisms for interoperability and responsible data sharing, while streamlining access management. The AnVIL opens many new opportunities for analysis, collaboration, and data sharing that are needed to drive research and to make discoveries through the joint analysis of hundreds of thousands to millions of genomes along with associated clinical and molecular data types.

7.
Genome Biol ; 22(1): 323, 2021 11 29.
Artículo en Inglés | MEDLINE | ID: mdl-34844637

RESUMEN

We present recount3, a resource consisting of over 750,000 publicly available human and mouse RNA sequencing (RNA-seq) samples uniformly processed by our new Monorail analysis pipeline. To facilitate access to the data, we provide the recount3 and snapcount R/Bioconductor packages as well as complementary web resources. Using these tools, data can be downloaded as study-level summaries or queried for specific exon-exon junctions, genes, samples, or other features. Monorail can be used to process local and/or private data, allowing results to be directly compared to any study in recount3. Taken together, our tools help biologists maximize the utility of publicly available RNA-seq data, especially to improve their understanding of newly collected data. recount3 is available from http://rna.recount.bio .


Asunto(s)
Empalme del ARN , RNA-Seq/métodos , ARN/genética , Animales , Secuencia de Bases , Biología Computacional/métodos , Exones , Regulación de la Expresión Génica , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Ratones , Análisis de Secuencia de ARN/métodos , Programas Informáticos
8.
J Thromb Haemost ; 19(7): 1783-1799, 2021 07.
Artículo en Inglés | MEDLINE | ID: mdl-33829634

RESUMEN

BACKGROUND: There is interest in deriving megakaryocytes (MKs) from pluripotent stem cells (iPSC) for biological studies. We previously found that genomic structural integrity and genotype concordance is maintained in iPSC-derived MKs. OBJECTIVE: To establish a comprehensive dataset of genes and proteins expressed in iPSC-derived MKs. METHODS: iPSCs were reprogrammed from peripheral blood mononuclear cells (MNCs) and MKs were derived from the iPSCs in 194 healthy European American and African American subjects. mRNA was isolated and gene expression measured by RNA sequencing. Protein expression was measured in 62 of the subjects using mass spectrometry. RESULTS AND CONCLUSIONS: MKs expressed genes and proteins known to be important in MK and platelet function and demonstrated good agreement with previous studies in human MKs derived from CD34+ progenitor cells. The percent of cells expressing the MK markers CD41 and CD42a was consistent in biological replicates, but variable across subjects, suggesting that unidentified subject-specific factors determine differentiation of MKs from iPSCs. Gene and protein sets important in platelet function were associated with increasing expression of CD41/42a, while those related to more basic cellular functions were associated with lower CD41/42a expression. There was differential gene expression by the sex and race (but not age) of the subject. Numerous genes and proteins were highly expressed in MKs but not known to play a role in MK or platelet function; these represent excellent candidates for future study of hematopoiesis, platelet formation, and/or platelet function.


Asunto(s)
Células Madre Pluripotentes Inducidas , Plaquetas , Diferenciación Celular , Genómica , Humanos , Leucocitos Mononucleares , Megacariocitos
9.
Blood ; 137(7): 959-968, 2021 02 18.
Artículo en Inglés | MEDLINE | ID: mdl-33094331

RESUMEN

Genome-wide association studies have identified common variants associated with platelet-related phenotypes, but because these variants are largely intronic or intergenic, their link to platelet biology is unclear. In 290 normal subjects from the GeneSTAR Research Study (110 African Americans [AAs] and 180 European Americans [EAs]), we generated whole-genome sequence data from whole blood and RNA sequence data from extracted nonribosomal RNA from 185 induced pluripotent stem cell-derived megakaryocyte (MK) cell lines (platelet precursor cells) and 290 blood platelet samples from these subjects. Using eigenMT software to select the peak single-nucleotide polymorphism (SNP) for each expressed gene, and meta-analyzing the results of AAs and EAs, we identify (q-value < 0.05) 946 cis-expression quantitative trait loci (eQTLs) in derived MKs and 1830 cis-eQTLs in blood platelets. Among the 57 eQTLs shared between the 2 tissues, the estimated directions of effect are very consistent (98.2% concordance). A high proportion of detected cis-eQTLs (74.9% in MKs and 84.3% in platelets) are unique to MKs and platelets compared with peak-associated SNP-expressed gene pairs of 48 other tissue types that are reported in version V7 of the Genotype-Tissue Expression Project. The locations of our identified eQTLs are significantly enriched for overlap with several annotation tracks highlighting genomic regions with specific functionality in MKs, including MK-specific DNAse hotspots, H3K27-acetylation marks, H3K4-methylation marks, enhancers, and superenhancers. These results offer insights into the regulatory signature of MKs and platelets, with significant overlap in genes expressed, eQTLs detected, and enrichment within known superenhancers relevant to platelet biology.


Asunto(s)
Plaquetas/metabolismo , Células Madre Pluripotentes Inducidas/citología , Megacariocitos/metabolismo , ARN/genética , Transcriptoma , Adulto , Población Negra/genética , Plaquetas/citología , Células Cultivadas , Femenino , Ontología de Genes , Estudio de Asociación del Genoma Completo , Humanos , Masculino , Megacariocitos/citología , Especificidad de Órganos , Polimorfismo de Nucleótido Simple , Sitios de Carácter Cuantitativo , ARN/biosíntesis , RNA-Seq , Población Blanca/genética , Secuenciación Completa del Genoma
10.
Proc Natl Acad Sci U S A ; 117(48): 30266-30275, 2020 12 01.
Artículo en Inglés | MEDLINE | ID: mdl-33208538

RESUMEN

Many modern problems in medicine and public health leverage machine-learning methods to predict outcomes based on observable covariates. In a wide array of settings, predicted outcomes are used in subsequent statistical analysis, often without accounting for the distinction between observed and predicted outcomes. We call inference with predicted outcomes postprediction inference. In this paper, we develop methods for correcting statistical inference using outcomes predicted with arbitrarily complicated machine-learning models including random forests and deep neural nets. Rather than trying to derive the correction from first principles for each machine-learning algorithm, we observe that there is typically a low-dimensional and easily modeled representation of the relationship between the observed and predicted outcomes. We build an approach for postprediction inference that naturally fits into the standard machine-learning framework where the data are divided into training, testing, and validation sets. We train the prediction model in the training set, estimate the relationship between the observed and predicted outcomes in the testing set, and use that relationship to correct subsequent inference in the validation set. We show our postprediction inference (postpi) approach can correct bias and improve variance estimation and subsequent statistical inference with predicted outcomes. To show the broad range of applicability of our approach, we show postpi can improve inference in two distinct fields: modeling predicted phenotypes in repurposed gene expression data and modeling predicted causes of death in verbal autopsy data. Our method is available through an open-source R package: https://github.com/leekgroup/postpi.


Asunto(s)
Aprendizaje Automático , Causas de Muerte , Simulación por Computador , Humanos , Especificidad de Órganos
12.
Genome Res ; 30(7): 1073-1081, 2020 07.
Artículo en Inglés | MEDLINE | ID: mdl-32079618

RESUMEN

Long noncoding RNAs (lncRNAs) have emerged as key coordinators of biological and cellular processes. Characterizing lncRNA expression across cells and tissues is key to understanding their role in determining phenotypes, including human diseases. We present here FC-R2, a comprehensive expression atlas across a broadly defined human transcriptome, inclusive of over 109,000 coding and noncoding genes, as described in the FANTOM CAGE-Associated Transcriptome (FANTOM-CAT) study. This atlas greatly extends the gene annotation used in the original recount2 resource. We demonstrate the utility of the FC-R2 atlas by reproducing key findings from published large studies and by generating new results across normal and diseased human samples. In particular, we (a) identify tissue-specific transcription profiles for distinct classes of coding and noncoding genes, (b) perform differential expression analysis across thirteen cancer types, identifying novel noncoding genes potentially involved in tumor pathogenesis and progression, and (c) confirm the prognostic value for several enhancer lncRNAs expression in cancer. Our resource is instrumental for the systematic molecular characterization of lncRNA by the FANTOM6 Consortium. In conclusion, comprised of over 70,000 samples, the FC-R2 atlas will empower other researchers to investigate functions and biological roles of both known coding genes and novel lncRNAs.


Asunto(s)
Transcriptoma , Bases de Datos Genéticas , Elementos de Facilitación Genéticos , Perfilación de la Expresión Génica , Genoma Humano , Humanos , Neoplasias/genética , Especificidad de Órganos , Pronóstico , ARN Largo no Codificante/genética , ARN Largo no Codificante/metabolismo , ARN Mensajero/metabolismo
13.
J Stat Educ ; 28(1): 98-108, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-33762806

RESUMEN

We performed an empirical study of the perceived quality of scientific graphics produced by beginning R users in two plotting systems: the base graphics package ("base R") and the ggplot2 add-on package. In our experiment, students taking a data science course on the Coursera platform were randomized to complete identical plotting exercises using either base R or ggplot2. This exercise involved creating two plots: one bivariate scatterplot and one plot of a multivariate relationship that necessitated using color or panels. Students evaluated their peers on visual characteristics key to clear scientific communication, including plot clarity and sufficient labeling. We observed that graphics created with the two systems rated similarly on many characteristics. However, ggplot2 graphics were generally perceived by students to be slightly more clear overall with respect to presentation of a scientific relationship. This increase was more pronounced for the multivariate relationship. Through expert analysis of submissions, we also find that certain concrete plot features (e.g., trend lines, axis labels, legends, panels, and color) tend to be used more commonly in one system than the other. These observations may help educators emphasize the use of certain plot features targeted to correct common student mistakes.

14.
Nat Hum Behav ; 3(8): 886, 2019 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-31358976

RESUMEN

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

16.
Genome Biol ; 20(1): 94, 2019 05 16.
Artículo en Inglés | MEDLINE | ID: mdl-31097038

RESUMEN

Gene co-expression networks capture biological relationships between genes and are important tools in predicting gene function and understanding disease mechanisms. We show that technical and biological artifacts in gene expression data confound commonly used network reconstruction algorithms. We demonstrate theoretically, in simulation, and empirically, that principal component correction of gene expression measurements prior to network inference can reduce false discoveries. Using data from the GTEx project in multiple tissues, we show that this approach reduces false discoveries beyond correcting only for known confounders.


Asunto(s)
Redes Reguladoras de Genes , Técnicas Genéticas , Artefactos , Humanos
17.
PeerJ ; 6: e6035, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-30581661

RESUMEN

Modern scientific studies from many diverse areas of research abound with multiple hypothesis testing concerns. The false discovery rate (FDR) is one of the most commonly used approaches for measuring and controlling error rates when performing multiple tests. Adaptive FDRs rely on an estimate of the proportion of null hypotheses among all the hypotheses being tested. This proportion is typically estimated once for each collection of hypotheses. Here, we propose a regression framework to estimate the proportion of null hypotheses conditional on observed covariates. This may then be used as a multiplication factor with the Benjamini-Hochberg adjusted p-values, leading to a plug-in FDR estimator. We apply our method to a genome-wise association meta-analysis for body mass index. In our framework, we are able to use the sample sizes for the individual genomic loci and the minor allele frequencies as covariates. We further evaluate our approach via a number of simulation scenarios. We provide an implementation of this novel method for estimating the proportion of null hypotheses in a regression framework as part of the Bioconductor package swfdr.

18.
PeerJ ; 6: e5597, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-30225177

RESUMEN

Most researchers do not deliberately claim causal results in an observational study. But do we lead our readers to draw a causal conclusion unintentionally by explaining why significant correlations and relationships may exist? Here we perform a randomized controlled experiment in a massive open online course run in 2013 that teaches data analysis concepts to test the hypothesis that explaining an analysis will lead readers to interpret an inferential analysis as causal. We test this hypothesis with a single example of an observational study on the relationship between smoking and cancer. We show that adding an explanation to the description of an inferential analysis leads to a 15.2% increase in readers interpreting the analysis as causal (95% confidence interval for difference in two proportions: 12.8%-17.5%). We then replicate this finding in a second large scale massive open online course. Nearly every scientific study, regardless of the study design, includes an explanation for observed effects. Our results suggest that these explanations may be misleading to the audience of these data analyses and that qualification of explanations could be a useful avenue of exploration in future research to counteract the problem. Our results invite many opportunities for further research to broaden the scope of these findings beyond the single smoking-cancer example examined here.

19.
Nat Neurosci ; 21(8): 1117-1125, 2018 08.
Artículo en Inglés | MEDLINE | ID: mdl-30050107

RESUMEN

Genome-wide association studies have identified 108 schizophrenia risk loci, but biological mechanisms for individual loci are largely unknown. Using developmental, genetic and illness-based RNA sequencing expression analysis in human brain, we characterized the human brain transcriptome around these loci and found enrichment for developmentally regulated genes with novel examples of shifting isoform usage across pre- and postnatal life. We found widespread expression quantitative trait loci (eQTLs), including many with transcript specificity and previously unannotated sequence that were independently replicated. We leveraged this general eQTL database to show that 48.1% of risk variants for schizophrenia associate with nearby expression. We lastly found 237 genes significantly differentially expressed between patients and controls, which replicated in an independent dataset, implicated synaptic processes, and were strongly regulated in early development. These findings together offer genetics- and diagnosis-related targets for better modeling of schizophrenia risk. This resource is publicly available at http://eqtl.brainseq.org/phase1 .


Asunto(s)
Regulación del Desarrollo de la Expresión Génica/genética , Corteza Prefrontal/crecimiento & desarrollo , Corteza Prefrontal/fisiopatología , Esquizofrenia/genética , Esquizofrenia/fisiopatología , Transcriptoma/genética , Adolescente , Adulto , Autopsia , Niño , Preescolar , Enfermedad Crónica , Bases de Datos Genéticas , Femenino , Predisposición Genética a la Enfermedad/genética , Variación Genética , Genotipo , Humanos , Lactante , Masculino , Polimorfismo de Nucleótido Simple , Embarazo , Análisis de Secuencia de ARN
20.
Nucleic Acids Res ; 46(9): e54, 2018 05 18.
Artículo en Inglés | MEDLINE | ID: mdl-29514223

RESUMEN

Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. We develop an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using well-annotated genomic and phenotypic data produced by consortia like TCGA and GTEx as training data. We apply in silico phenotyping to a set of 70 000 RNA-seq samples we recently processed on a common pipeline as part of the recount2 project. We use gene expression data to build and evaluate predictors for both biological phenotypes (sex, tissue, sample source) and experimental conditions (sequencing strategy). We demonstrate how these predictions can be used to study cross-sample properties of public genomic data, select genomic projects with specific characteristics, and perform downstream analyses using predicted phenotypes. The methods to perform phenotype prediction are available in the phenopredict R package and the predictions for recount2 are available from the recount R package. With data and phenotype information available for 70,000 human samples, expression data is available for use on a scale that was not previously feasible.


Asunto(s)
Perfilación de la Expresión Génica , Fenotipo , Análisis de Secuencia de ARN , Simulación por Computador , Femenino , Humanos , Masculino , Programas Informáticos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...