Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 75
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
2.
Genome Res ; 30(7): 1073-1081, 2020 07.
Artículo en Inglés | MEDLINE | ID: mdl-32079618

RESUMEN

Long noncoding RNAs (lncRNAs) have emerged as key coordinators of biological and cellular processes. Characterizing lncRNA expression across cells and tissues is key to understanding their role in determining phenotypes, including human diseases. We present here FC-R2, a comprehensive expression atlas across a broadly defined human transcriptome, inclusive of over 109,000 coding and noncoding genes, as described in the FANTOM CAGE-Associated Transcriptome (FANTOM-CAT) study. This atlas greatly extends the gene annotation used in the original recount2 resource. We demonstrate the utility of the FC-R2 atlas by reproducing key findings from published large studies and by generating new results across normal and diseased human samples. In particular, we (a) identify tissue-specific transcription profiles for distinct classes of coding and noncoding genes, (b) perform differential expression analysis across thirteen cancer types, identifying novel noncoding genes potentially involved in tumor pathogenesis and progression, and (c) confirm the prognostic value for several enhancer lncRNAs expression in cancer. Our resource is instrumental for the systematic molecular characterization of lncRNA by the FANTOM6 Consortium. In conclusion, comprised of over 70,000 samples, the FC-R2 atlas will empower other researchers to investigate functions and biological roles of both known coding genes and novel lncRNAs.


Asunto(s)
Transcriptoma , Bases de Datos Genéticas , Elementos de Facilitación Genéticos , Perfilación de la Expresión Génica , Genoma Humano , Humanos , Neoplasias/genética , Especificidad de Órganos , Pronóstico , ARN Largo no Codificante/genética , ARN Largo no Codificante/metabolismo , ARN Mensajero/metabolismo
3.
Blood ; 137(7): 959-968, 2021 02 18.
Artículo en Inglés | MEDLINE | ID: mdl-33094331

RESUMEN

Genome-wide association studies have identified common variants associated with platelet-related phenotypes, but because these variants are largely intronic or intergenic, their link to platelet biology is unclear. In 290 normal subjects from the GeneSTAR Research Study (110 African Americans [AAs] and 180 European Americans [EAs]), we generated whole-genome sequence data from whole blood and RNA sequence data from extracted nonribosomal RNA from 185 induced pluripotent stem cell-derived megakaryocyte (MK) cell lines (platelet precursor cells) and 290 blood platelet samples from these subjects. Using eigenMT software to select the peak single-nucleotide polymorphism (SNP) for each expressed gene, and meta-analyzing the results of AAs and EAs, we identify (q-value < 0.05) 946 cis-expression quantitative trait loci (eQTLs) in derived MKs and 1830 cis-eQTLs in blood platelets. Among the 57 eQTLs shared between the 2 tissues, the estimated directions of effect are very consistent (98.2% concordance). A high proportion of detected cis-eQTLs (74.9% in MKs and 84.3% in platelets) are unique to MKs and platelets compared with peak-associated SNP-expressed gene pairs of 48 other tissue types that are reported in version V7 of the Genotype-Tissue Expression Project. The locations of our identified eQTLs are significantly enriched for overlap with several annotation tracks highlighting genomic regions with specific functionality in MKs, including MK-specific DNAse hotspots, H3K27-acetylation marks, H3K4-methylation marks, enhancers, and superenhancers. These results offer insights into the regulatory signature of MKs and platelets, with significant overlap in genes expressed, eQTLs detected, and enrichment within known superenhancers relevant to platelet biology.


Asunto(s)
Plaquetas/metabolismo , Células Madre Pluripotentes Inducidas/citología , Megacariocitos/metabolismo , ARN/genética , Transcriptoma , Adulto , Población Negra/genética , Plaquetas/citología , Células Cultivadas , Femenino , Ontología de Genes , Estudio de Asociación del Genoma Completo , Humanos , Masculino , Megacariocitos/citología , Especificidad de Órganos , Polimorfismo de Nucleótido Simple , Sitios de Carácter Cuantitativo , ARN/biosíntesis , RNA-Seq , Población Blanca/genética , Secuenciación Completa del Genoma
4.
Proc Natl Acad Sci U S A ; 117(48): 30266-30275, 2020 12 01.
Artículo en Inglés | MEDLINE | ID: mdl-33208538

RESUMEN

Many modern problems in medicine and public health leverage machine-learning methods to predict outcomes based on observable covariates. In a wide array of settings, predicted outcomes are used in subsequent statistical analysis, often without accounting for the distinction between observed and predicted outcomes. We call inference with predicted outcomes postprediction inference. In this paper, we develop methods for correcting statistical inference using outcomes predicted with arbitrarily complicated machine-learning models including random forests and deep neural nets. Rather than trying to derive the correction from first principles for each machine-learning algorithm, we observe that there is typically a low-dimensional and easily modeled representation of the relationship between the observed and predicted outcomes. We build an approach for postprediction inference that naturally fits into the standard machine-learning framework where the data are divided into training, testing, and validation sets. We train the prediction model in the training set, estimate the relationship between the observed and predicted outcomes in the testing set, and use that relationship to correct subsequent inference in the validation set. We show our postprediction inference (postpi) approach can correct bias and improve variance estimation and subsequent statistical inference with predicted outcomes. To show the broad range of applicability of our approach, we show postpi can improve inference in two distinct fields: modeling predicted phenotypes in repurposed gene expression data and modeling predicted causes of death in verbal autopsy data. Our method is available through an open-source R package: https://github.com/leekgroup/postpi.


Asunto(s)
Aprendizaje Automático , Causas de Muerte , Simulación por Computador , Humanos , Especificidad de Órganos
6.
Nucleic Acids Res ; 46(9): e54, 2018 05 18.
Artículo en Inglés | MEDLINE | ID: mdl-29514223

RESUMEN

Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. We develop an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using well-annotated genomic and phenotypic data produced by consortia like TCGA and GTEx as training data. We apply in silico phenotyping to a set of 70 000 RNA-seq samples we recently processed on a common pipeline as part of the recount2 project. We use gene expression data to build and evaluate predictors for both biological phenotypes (sex, tissue, sample source) and experimental conditions (sequencing strategy). We demonstrate how these predictions can be used to study cross-sample properties of public genomic data, select genomic projects with specific characteristics, and perform downstream analyses using predicted phenotypes. The methods to perform phenotype prediction are available in the phenopredict R package and the predictions for recount2 are available from the recount R package. With data and phenotype information available for 70,000 human samples, expression data is available for use on a scale that was not previously feasible.


Asunto(s)
Perfilación de la Expresión Génica , Fenotipo , Análisis de Secuencia de ARN , Simulación por Computador , Femenino , Humanos , Masculino , Programas Informáticos
7.
Proc Natl Acad Sci U S A ; 114(27): 7130-7135, 2017 07 03.
Artículo en Inglés | MEDLINE | ID: mdl-28634288

RESUMEN

RNA sequencing (RNA-seq) is a powerful approach for measuring gene expression levels in cells and tissues, but it relies on high-quality RNA. We demonstrate here that statistical adjustment using existing quality measures largely fails to remove the effects of RNA degradation when RNA quality associates with the outcome of interest. Using RNA-seq data from molecular degradation experiments of human primary tissues, we introduce a method-quality surrogate variable analysis (qSVA)-as a framework for estimating and removing the confounding effect of RNA quality in differential expression analysis. We show that this approach results in greatly improved replication rates (>3×) across two large independent postmortem human brain studies of schizophrenia and also removes potential RNA quality biases in earlier published work that compared expression levels of different brain regions and other diagnostic groups. Our approach can therefore improve the interpretation of differential expression analysis of transcriptomic data from human tissue.


Asunto(s)
ARN/análisis , Análisis de Secuencia de ARN/métodos , Algoritmos , Animales , Biología Computacional , Replicación del ADN , Perfilación de la Expresión Génica , Regulación de la Expresión Génica , Genotipo , Sustancia Gris , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Análisis de Secuencia por Matrices de Oligonucleótidos , ARN/genética , Esquizofrenia/genética , Esquizofrenia/metabolismo , Transcriptoma
8.
Nucleic Acids Res ; 45(2): e9, 2017 01 25.
Artículo en Inglés | MEDLINE | ID: mdl-27694310

RESUMEN

Differential expression analysis of RNA sequencing (RNA-seq) data typically relies on reconstructing transcripts or counting reads that overlap known gene structures. We previously introduced an intermediate statistical approach called differentially expressed region (DER) finder that seeks to identify contiguous regions of the genome showing differential expression signal at single base resolution without relying on existing annotation or potentially inaccurate transcript assembly.We present the derfinder software that improves our annotation-agnostic approach to RNA-seq analysis by: (i) implementing a computationally efficient bump-hunting approach to identify DERs that permits genome-scale analyses in a large number of samples, (ii) introducing a flexible statistical modeling framework, including multi-group and time-course analyses and (iii) introducing a new set of data visualizations for expressed region analysis. We apply this approach to public RNA-seq data from the Genotype-Tissue Expression (GTEx) project and BrainSpan project to show that derfinder permits the analysis of hundreds of samples at base resolution in R, identifies expression outside of known gene boundaries and can be used to visualize expressed regions at base-resolution. In simulations, our base resolution approaches enable discovery in the presence of incomplete annotation and is nearly as powerful as feature-level methods when the annotation is complete.derfinder analysis using expressed region-level and single base-level approaches provides a compromise between full transcript reconstruction and feature-level analysis. The package is available from Bioconductor at www.bioconductor.org/packages/derfinder.


Asunto(s)
Perfilación de la Expresión Génica/métodos , Programas Informáticos , Regulación de la Expresión Génica , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento , Anotación de Secuencia Molecular , Especificidad de Órganos/genética , Transcriptoma , Navegador Web
9.
Hum Mol Genet ; 25(22): 4962-4982, 2016 11 15.
Artículo en Inglés | MEDLINE | ID: mdl-28171598

RESUMEN

We performed a thorough characterization of expressed repetitive element loci (RE) in the human orbitofrontal cortex (OFC) using directional RNA sequencing data. Considering only sequencing reads that map uniquely onto the human genome, we discovered that the overwhelming majority of intronic and exonic RE are expressed in the same orientation as the gene in which they reside. Our mapping approach enabled the identification of novel differentially expressed RE transcripts between the OFC and peripheral blood lymphocytes. Further analysis revealed that RE are extensively spliced into coding regions of gene transcripts yielding thousands of novel mRNA variants with altered coding potential. Lower frequency splicing of RE into untranslated regions of gene transcripts was also observed. The same pattern of RE splicing in the brain was also detected for Drosophila, zebrafish, mouse, rat, dog and rabbit. RE splicing occurs largely at canonical GT-AG splice junctions with LINE and SINE elements forming the most RE splice junctions in the human OFC. This type of splicing usually gives rise to a minor splice variant of the endogenous gene and in silico analysis suggests that RE splicing has the potential to introduce novel open reading frames. Reanalysis of previously published sequencing data performed in the mouse cerebellum revealed that thousands of RE splice variants are associated with translating ribosomes. Our results demonstrate that RE expression is more complex than previously envisioned and raise the possibility that RE splicing might generate functional protein isoforms.


Asunto(s)
Secuencias Repetitivas Esparcidas/genética , Sitios de Empalme de ARN/genética , Empalme del ARN/genética , Empalme Alternativo/genética , Animales , Secuencia de Bases , Encéfalo/metabolismo , ADN/genética , Exones , Perfilación de la Expresión Génica/métodos , Genoma/genética , Humanos , Intrones , Sistemas de Lectura Abierta/genética , Corteza Prefrontal/metabolismo , Isoformas de Proteínas/genética , ARN Mensajero/genética , Secuencias Repetitivas de Ácidos Nucleicos/genética , Análisis de Secuencia de ARN , Regiones no Traducidas/genética
10.
Bioinformatics ; 33(24): 4033-4040, 2017 Dec 15.
Artículo en Inglés | MEDLINE | ID: mdl-27592709

RESUMEN

MOTIVATION: RNA sequencing (RNA-seq) experiments now span hundreds to thousands of samples. Current spliced alignment software is designed to analyze each sample separately. Consequently, no information is gained from analyzing multiple samples together, and it requires extra work to obtain analysis products that incorporate data from across samples. RESULTS: We describe Rail-RNA, a cloud-enabled spliced aligner that analyzes many samples at once. Rail-RNA eliminates redundant work across samples, making it more efficient as samples are added. For many samples, Rail-RNA is more accurate than annotation-assisted aligners. We use Rail-RNA to align 667 RNA-seq samples from the GEUVADIS project on Amazon Web Services in under 16 h for US$0.91 per sample. Rail-RNA outputs alignments in SAM/BAM format; but it also outputs (i) base-level coverage bigWigs for each sample; (ii) coverage bigWigs encoding normalized mean and median coverages at each base across samples analyzed; and (iii) exon-exon splice junctions and indels (features) in columnar formats that juxtapose coverages in samples in which a given feature is found. Supplementary outputs are ready for use with downstream packages for reproducible statistical analysis. We use Rail-RNA to identify expressed regions in the GEUVADIS samples and show that both annotated and unannotated (novel) expressed regions exhibit consistent patterns of variation across populations and with respect to known confounding variables. AVAILABILITY AND IMPLEMENTATION: Rail-RNA is open-source software available at http://rail.bio. CONTACTS: anellore@gmail.com or langmea@cs.jhu.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Empalme del ARN , Alineación de Secuencia/métodos , Análisis de Secuencia de ARN/métodos , Programas Informáticos , Exones , Perfilación de la Expresión Génica
11.
Bioinformatics ; 32(16): 2551-3, 2016 08 15.
Artículo en Inglés | MEDLINE | ID: mdl-27153614

RESUMEN

MOTIVATION: Public archives contain thousands of trillions of bases of valuable sequencing data. More than 40% of the Sequence Read Archive is human data protected by provisions such as dbGaP. To analyse dbGaP-protected data, researchers must typically work with IT administrators and signing officials to ensure all levels of security are implemented at their institution. This is a major obstacle, impeding reproducibility and reducing the utility of archived data. RESULTS: We present a protocol and software tool for analyzing protected data in a commercial cloud. The protocol, Rail-dbGaP, is applicable to any tool running on Amazon Web Services Elastic MapReduce. The tool, Rail-RNA v0.2, is a spliced aligner for RNA-seq data, which we demonstrate by running on 9662 samples from the dbGaP-protected GTEx consortium dataset. The Rail-dbGaP protocol makes explicit for the first time the steps an investigator must take to develop Elastic MapReduce pipelines that analyse dbGaP-protected data in a manner compliant with NIH guidelines. Rail-RNA automates implementation of the protocol, making it easy for typical biomedical investigators to study protected RNA-seq data, regardless of their local IT resources or expertise. AVAILABILITY AND IMPLEMENTATION: Rail-RNA is available from http://rail.bio Technical details on the Rail-dbGaP protocol as well as an implementation walkthrough are available at https://github.com/nellore/rail-dbgap Detailed instructions on running Rail-RNA on dbGaP-protected data using Amazon Web Services are available at http://docs.rail.bio/dbgap/ CONTACTS: : anellore@gmail.com or langmea@cs.jhu.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Biología Computacional , Bases de Datos Genéticas , Programas Informáticos , Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , ARN , Reproducibilidad de los Resultados
12.
Bioinformatics ; 32(24): 3836-3838, 2016 12 15.
Artículo en Inglés | MEDLINE | ID: mdl-27540268

RESUMEN

Sequencing and microarray samples often are collected or processed in multiple batches or at different times. This often produces technical biases that can lead to incorrect results in the downstream analysis. There are several existing batch adjustment tools for '-omics' data, but they do not indicate a priori whether adjustment needs to be conducted or how correction should be applied. We present a software pipeline, BatchQC, which addresses these issues using interactive visualizations and statistics that evaluate the impact of batch effects in a genomic dataset. BatchQC can also apply existing adjustment tools and allow users to evaluate their benefits interactively. We used the BatchQC pipeline on both simulated and real data to demonstrate the effectiveness of this software toolkit. AVAILABILITY AND IMPLEMENTATION: BatchQC is available through Bioconductor: http://bioconductor.org/packages/BatchQC and GitHub: https://github.com/mani2012/BatchQC CONTACT: wej@bu.eduSupplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Biología Computacional/métodos , Genómica/métodos , Programas Informáticos , Genoma , Humanos , Interfaz Usuario-Computador
13.
Nature ; 478(7370): 519-23, 2011 Oct 26.
Artículo en Inglés | MEDLINE | ID: mdl-22031444

RESUMEN

Previous investigations have combined transcriptional and genetic analyses in human cell lines, but few have applied these techniques to human neural tissue. To gain a global molecular perspective on the role of the human genome in cortical development, function and ageing, we explore the temporal dynamics and genetic control of transcription in human prefrontal cortex in an extensive series of post-mortem brains from fetal development through ageing. We discover a wave of gene expression changes occurring during fetal development which are reversed in early postnatal life. One half-century later in life, this pattern of reversals is mirrored in ageing and in neurodegeneration. Although we identify thousands of robust associations of individual genetic polymorphisms with gene expression, we also demonstrate that there is no association between the total extent of genetic differences between subjects and the global similarity of their transcriptional profiles. Hence, the human genome produces a consistent molecular architecture in the prefrontal cortex, despite millions of genetic differences across individuals and races. To enable further discovery, this entire data set is freely available (from Gene Expression Omnibus: accession GSE30272; and dbGaP: accession phs000417.v1.p1) and can also be interrogated via a biologist-friendly stand-alone application (http://www.libd.org/braincloud).


Asunto(s)
Envejecimiento/genética , Perfilación de la Expresión Génica , Regulación del Desarrollo de la Expresión Génica/genética , Corteza Prefrontal/crecimiento & desarrollo , Corteza Prefrontal/metabolismo , Transcriptoma/genética , Autopsia , Feto/metabolismo , Genoma Humano/genética , Humanos , Polimorfismo de Nucleótido Simple/genética , Corteza Prefrontal/embriología , Grupos Raciales/genética , Factores de Tiempo
14.
Bioinformatics ; 31(17): 2778-84, 2015 Sep 01.
Artículo en Inglés | MEDLINE | ID: mdl-25926345

RESUMEN

MOTIVATION: Statistical methods development for differential expression analysis of RNA sequencing (RNA-seq) requires software tools to assess accuracy and error rate control. Since true differential expression status is often unknown in experimental datasets, artificially constructed datasets must be utilized, either by generating costly spike-in experiments or by simulating RNA-seq data. RESULTS: Polyester is an R package designed to simulate RNA-seq data, beginning with an experimental design and ending with collections of RNA-seq reads. Its main advantage is the ability to simulate reads indicating isoform-level differential expression across biological replicates for a variety of experimental designs. Data generated by Polyester is a reasonable approximation to real RNA-seq data and standard differential expression workflows can recover differential expression set in the simulation by the user. AVAILABILITY AND IMPLEMENTATION: Polyester is freely available from Bioconductor (http://bioconductor.org/). CONTACT: jtleek@gmail.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Cromosomas Humanos Par 22/genética , Biología Computacional/métodos , Perfilación de la Expresión Génica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ARN/métodos , Programas Informáticos , Algoritmos , Distribución Binomial , Europa (Continente) , Regulación de la Expresión Génica , Genética de Población , Haplotipos/genética , Humanos , Isoformas de Proteínas , ARN/genética
15.
Bioinformatics ; 31(14): 2318-23, 2015 Jul 15.
Artículo en Inglés | MEDLINE | ID: mdl-25788628

RESUMEN

MOTIVATION: Prior to applying genomic predictors to clinical samples, the genomic data must be properly normalized to ensure that the test set data are comparable to the data upon which the predictor was trained. The most effective normalization methods depend on data from multiple patients. From a biomedical perspective, this implies that predictions for a single patient may change depending on which other patient samples they are normalized with. This test set bias will occur when any cross-sample normalization is used before clinical prediction. RESULTS: We demonstrate that results from existing gene signatures which rely on normalizing test data may be irreproducible when the patient population changes composition or size using a set of curated, publicly available breast cancer microarray experiments. As an alternative, we examine the use of gene signatures that rely on ranks from the data and show why signatures using rank-based features can avoid test set bias while maintaining highly accurate classification, even across platforms. AVAILABILITY AND IMPLEMENTATION: The code, data and instructions necessary to reproduce our entire analysis is available at https://github.com/prpatil/testsetbias.


Asunto(s)
Biomarcadores de Tumor/genética , Neoplasias de la Mama/genética , Perfilación de la Expresión Génica/métodos , Genómica/métodos , Modelos Estadísticos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Neoplasias de la Mama/mortalidad , Neoplasias de la Mama/patología , Femenino , Regulación Neoplásica de la Expresión Génica , Humanos , Persona de Mediana Edad , Clasificación del Tumor , Estadificación de Neoplasias , Pronóstico , Receptor ErbB-2/metabolismo , Receptores de Estrógenos/metabolismo , Reproducibilidad de los Resultados , Tasa de Supervivencia
16.
Nat Rev Genet ; 11(10): 733-9, 2010 10.
Artículo en Inglés | MEDLINE | ID: mdl-20838408

RESUMEN

High-throughput technologies are widely used, for example to assay genetic variants, gene and protein expression, and epigenetic modifications. One often overlooked complication with such studies is batch effects, which occur because measurements are affected by laboratory conditions, reagent lots and personnel differences. This becomes a major problem when batch effects are correlated with an outcome of interest and lead to incorrect conclusions. Using both published studies and our own analyses, we argue that batch effects (as well as other technical and biological artefacts) are widespread and critical to address. We review experimental and computational approaches for doing so.


Asunto(s)
Biotecnología/métodos , Genómica/métodos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Análisis de Secuencia de ADN/métodos , Biotecnología/normas , Biotecnología/estadística & datos numéricos , Biología Computacional/métodos , Genómica/normas , Genómica/estadística & datos numéricos , Análisis de Secuencia por Matrices de Oligonucleótidos/normas , Análisis de Secuencia por Matrices de Oligonucleótidos/estadística & datos numéricos , Publicaciones Periódicas como Asunto/normas , Proyectos de Investigación/normas , Proyectos de Investigación/estadística & datos numéricos , Análisis de Secuencia de ADN/normas , Análisis de Secuencia de ADN/estadística & datos numéricos
18.
Nucleic Acids Res ; 42(21)2014 Dec 01.
Artículo en Inglés | MEDLINE | ID: mdl-25294822

RESUMEN

It is now known that unwanted noise and unmodeled artifacts such as batch effects can dramatically reduce the accuracy of statistical inference in genomic experiments. These sources of noise must be modeled and removed to accurately measure biological variability and to obtain correct statistical inference when performing high-throughput genomic analysis. We introduced surrogate variable analysis (sva) for estimating these artifacts by (i) identifying the part of the genomic data only affected by artifacts and (ii) estimating the artifacts with principal components or singular vectors of the subset of the data matrix. The resulting estimates of artifacts can be used in subsequent analyses as adjustment factors to correct analyses. Here I describe a version of the sva approach specifically created for count data or FPKMs from sequencing experiments based on appropriate data transformation. I also describe the addition of supervised sva (ssva) for using control probes to identify the part of the genomic data only affected by artifacts. I present a comparison between these versions of sva and other methods for batch effect estimation on simulated data, real count-based data and FPKM-based data. These updates are available through the sva Bioconductor package and I have made fully reproducible analysis using these methods available from: https://github.com/jtleek/svaseq.


Asunto(s)
Artefactos , Perfilación de la Expresión Génica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Programas Informáticos , Algoritmos , Animales , Genómica/métodos , Pez Cebra/genética
19.
BMC Bioinformatics ; 16: 372, 2015 Nov 06.
Artículo en Inglés | MEDLINE | ID: mdl-26545828

RESUMEN

BACKGROUND: Genomic data production is at its highest level and continues to increase, making available novel primary data and existing public data to researchers for exploration. Here we explore the consequences of "batch" correction for biological discovery in two publicly available expression datasets. We consider this to include the estimation of and adjustment for wide-spread systematic heterogeneity in genomic measurements that is unrelated to the effects under study, whether it be technical or biological in nature. METHODS: We present three illustrative data analyses using surrogate variable analysis (SVA) and describe how to perform artifact discovery in light of natural heterogeneity within biological groups, secondary biological questions of interest, and non-linear treatment effects in a dataset profiling differentiating pluripotent cells (GSE32923) and another from human brain tissue (GSE30272). RESULTS: Careful specification of biological effects of interest is very important to factor-based approaches like SVA. We demonstrate greatly sharpened global and gene-specific differential expression across treatment groups in stem cell systems. Similarly, we demonstrate how to preserve major non-linear effects of age across the lifespan in the brain dataset. However, the gains in precisely defining known effects of interest come at the cost of much other information in the "cleaned" data, including sex, common copy number effects and sample or cell line-specific molecular behavior. CONCLUSIONS: Our analyses indicate that data "cleaning" can be an important component of high-throughput genomic data analysis when interrogating explicitly defined effects in the context of data affected by robust technical artifacts. However, caution should be exercised to avoid removing biological signal of interest. It is also important to note that open data exploration is not possible after such supervised "cleaning", because effects beyond those stipulated by the researcher may have been removed. With the goal of making these statistical algorithms more powerful and transparent to researchers in the biological sciences, we provide exploratory plots and accompanying R code for identifying and guiding "cleaning" process (https://github.com/andrewejaffe/StemCellSVA). The impact of these methods is significant enough that we have made newly processed data available for the brain data set at http://braincloud.jhmi.edu/plots/ and GSE30272.


Asunto(s)
Algoritmos , Encéfalo/metabolismo , Biología Computacional/métodos , Genoma Humano , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Células Madre Pluripotentes/metabolismo , Artefactos , Diferenciación Celular , Perfilación de la Expresión Génica , Humanos , Células Madre Pluripotentes/citología , Análisis de Regresión
20.
Biostatistics ; 15(1): 1-12, 2014 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-24068246

RESUMEN

The accuracy of published medical research is critical for scientists, physicians and patients who rely on these results. However, the fundamental belief in the medical literature was called into serious question by a paper suggesting that most published medical research is false. Here we adapt estimation methods from the genomics community to the problem of estimating the rate of false discoveries in the medical literature using reported $P$-values as the data. We then collect $P$-values from the abstracts of all 77 430 papers published in The Lancet, The Journal of the American Medical Association, The New England Journal of Medicine, The British Medical Journal, and The American Journal of Epidemiology between 2000 and 2010. Among these papers, we found 5322 reported $P$-values. We estimate that the overall rate of false discoveries among reported results is 14% (s.d. 1%), contrary to previous claims. We also found that there is no a significant increase in the estimated rate of reported false discovery results over time (0.5% more false positives (FP) per year, $P = 0.18$) or with respect to journal submissions (0.5% more FP per 100 submissions, $P = 0.12$). Statistical analysis must allow for false discoveries in order to make claims on the basis of noisy data. But our analysis suggests that the medical literature remains a reliable record of scientific progress.


Asunto(s)
Investigación Biomédica/normas , Interpretación Estadística de Datos , Reacciones Falso Positivas , Publicaciones/normas , Algoritmos , Simulación por Computador , Humanos , Programas Informáticos , Reino Unido , Estados Unidos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA