Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 78
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
Cell ; 187(17): 4449-4457, 2024 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-39178828

RESUMO

Computational data-centric research techniques play a prevalent and multi-disciplinary role in life science research. In the past, scientists in wet labs generated the data, and computational researchers focused on creating tools for the analysis of those data. Computational researchers are now becoming more independent and taking leadership roles within biomedical projects, leveraging the increased availability of public data. We are now able to generate vast amounts of data, and the challenge has shifted from data generation to data analysis. Here we discuss the pitfalls, challenges, and opportunities facing the field of data-centric research in biology. We discuss the evolving perception of computational data-driven research and its rise as an independent domain in biomedical research while also addressing the significant collaborative opportunities that arise from integrating computational research with experimental and translational biology. Additionally, we discuss the future of data-centric research and its applications across various areas of the biomedical field.


Assuntos
Pesquisa Biomédica , Biologia Computacional , Biologia Computacional/métodos , Humanos
2.
Bioinformatics ; 40(8)2024 08 02.
Artigo em Inglês | MEDLINE | ID: mdl-39067017

RESUMO

MOTIVATION: Software is vital for the advancement of biology and medicine. Impact evaluations of scientific software have primarily emphasized traditional citation metrics of associated papers, despite these metrics inadequately capturing the dynamic picture of impact and despite challenges with improper citation. RESULTS: To understand how software developers evaluate their tools, we conducted a survey of participants in the Informatics Technology for Cancer Research (ITCR) program funded by the National Cancer Institute (NCI). We found that although developers realize the value of more extensive metric collection, they find a lack of funding and time hindering. We also investigated software among this community for how often infrastructure that supports more nontraditional metrics were implemented and how this impacted rates of papers describing usage of the software. We found that infrastructure such as social media presence, more in-depth documentation, the presence of software health metrics, and clear information on how to contact developers seemed to be associated with increased mention rates. Analysing more diverse metrics can enable developers to better understand user engagement, justify continued funding, identify novel use cases, pinpoint improvement areas, and ultimately amplify their software's impact. Challenges are associated, including distorted or misleading metrics, as well as ethical and security concerns. More attention to nuances involved in capturing impact across the spectrum of biomedical software is needed. For funders and developers, we outline guidance based on experience from our community. By considering how we evaluate software, we can empower developers to create tools that more effectively accelerate biological and medical research progress. AVAILABILITY AND IMPLEMENTATION: More information about the analysis, as well as access to data and code is available at https://github.com/fhdsl/ITCR_Metrics_manuscript_website.


Assuntos
Pesquisa Biomédica , Software , Pesquisa Biomédica/métodos , Humanos , Estados Unidos , Biologia Computacional/métodos
4.
Genome Res ; 30(7): 1073-1081, 2020 07.
Artigo em Inglês | MEDLINE | ID: mdl-32079618

RESUMO

Long noncoding RNAs (lncRNAs) have emerged as key coordinators of biological and cellular processes. Characterizing lncRNA expression across cells and tissues is key to understanding their role in determining phenotypes, including human diseases. We present here FC-R2, a comprehensive expression atlas across a broadly defined human transcriptome, inclusive of over 109,000 coding and noncoding genes, as described in the FANTOM CAGE-Associated Transcriptome (FANTOM-CAT) study. This atlas greatly extends the gene annotation used in the original recount2 resource. We demonstrate the utility of the FC-R2 atlas by reproducing key findings from published large studies and by generating new results across normal and diseased human samples. In particular, we (a) identify tissue-specific transcription profiles for distinct classes of coding and noncoding genes, (b) perform differential expression analysis across thirteen cancer types, identifying novel noncoding genes potentially involved in tumor pathogenesis and progression, and (c) confirm the prognostic value for several enhancer lncRNAs expression in cancer. Our resource is instrumental for the systematic molecular characterization of lncRNA by the FANTOM6 Consortium. In conclusion, comprised of over 70,000 samples, the FC-R2 atlas will empower other researchers to investigate functions and biological roles of both known coding genes and novel lncRNAs.


Assuntos
Transcriptoma , Bases de Dados Genéticas , Elementos Facilitadores Genéticos , Perfilação da Expressão Gênica , Genoma Humano , Humanos , Neoplasias/genética , Especificidade de Órgãos , Prognóstico , RNA Longo não Codificante/genética , RNA Longo não Codificante/metabolismo , RNA Mensageiro/metabolismo
5.
Blood ; 137(7): 959-968, 2021 02 18.
Artigo em Inglês | MEDLINE | ID: mdl-33094331

RESUMO

Genome-wide association studies have identified common variants associated with platelet-related phenotypes, but because these variants are largely intronic or intergenic, their link to platelet biology is unclear. In 290 normal subjects from the GeneSTAR Research Study (110 African Americans [AAs] and 180 European Americans [EAs]), we generated whole-genome sequence data from whole blood and RNA sequence data from extracted nonribosomal RNA from 185 induced pluripotent stem cell-derived megakaryocyte (MK) cell lines (platelet precursor cells) and 290 blood platelet samples from these subjects. Using eigenMT software to select the peak single-nucleotide polymorphism (SNP) for each expressed gene, and meta-analyzing the results of AAs and EAs, we identify (q-value < 0.05) 946 cis-expression quantitative trait loci (eQTLs) in derived MKs and 1830 cis-eQTLs in blood platelets. Among the 57 eQTLs shared between the 2 tissues, the estimated directions of effect are very consistent (98.2% concordance). A high proportion of detected cis-eQTLs (74.9% in MKs and 84.3% in platelets) are unique to MKs and platelets compared with peak-associated SNP-expressed gene pairs of 48 other tissue types that are reported in version V7 of the Genotype-Tissue Expression Project. The locations of our identified eQTLs are significantly enriched for overlap with several annotation tracks highlighting genomic regions with specific functionality in MKs, including MK-specific DNAse hotspots, H3K27-acetylation marks, H3K4-methylation marks, enhancers, and superenhancers. These results offer insights into the regulatory signature of MKs and platelets, with significant overlap in genes expressed, eQTLs detected, and enrichment within known superenhancers relevant to platelet biology.


Assuntos
Plaquetas/metabolismo , Células-Tronco Pluripotentes Induzidas/citologia , Megacariócitos/metabolismo , RNA/genética , Transcriptoma , Adulto , População Negra/genética , Plaquetas/citologia , Células Cultivadas , Feminino , Ontologia Genética , Estudo de Associação Genômica Ampla , Humanos , Masculino , Megacariócitos/citologia , Especificidade de Órgãos , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas , RNA/biossíntese , RNA-Seq , População Branca/genética , Sequenciamento Completo do Genoma
6.
Proc Natl Acad Sci U S A ; 117(48): 30266-30275, 2020 12 01.
Artigo em Inglês | MEDLINE | ID: mdl-33208538

RESUMO

Many modern problems in medicine and public health leverage machine-learning methods to predict outcomes based on observable covariates. In a wide array of settings, predicted outcomes are used in subsequent statistical analysis, often without accounting for the distinction between observed and predicted outcomes. We call inference with predicted outcomes postprediction inference. In this paper, we develop methods for correcting statistical inference using outcomes predicted with arbitrarily complicated machine-learning models including random forests and deep neural nets. Rather than trying to derive the correction from first principles for each machine-learning algorithm, we observe that there is typically a low-dimensional and easily modeled representation of the relationship between the observed and predicted outcomes. We build an approach for postprediction inference that naturally fits into the standard machine-learning framework where the data are divided into training, testing, and validation sets. We train the prediction model in the training set, estimate the relationship between the observed and predicted outcomes in the testing set, and use that relationship to correct subsequent inference in the validation set. We show our postprediction inference (postpi) approach can correct bias and improve variance estimation and subsequent statistical inference with predicted outcomes. To show the broad range of applicability of our approach, we show postpi can improve inference in two distinct fields: modeling predicted phenotypes in repurposed gene expression data and modeling predicted causes of death in verbal autopsy data. Our method is available through an open-source R package: https://github.com/leekgroup/postpi.


Assuntos
Aprendizado de Máquina , Causas de Morte , Simulação por Computador , Humanos , Especificidade de Órgãos
8.
Nucleic Acids Res ; 46(9): e54, 2018 05 18.
Artigo em Inglês | MEDLINE | ID: mdl-29514223

RESUMO

Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. We develop an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using well-annotated genomic and phenotypic data produced by consortia like TCGA and GTEx as training data. We apply in silico phenotyping to a set of 70 000 RNA-seq samples we recently processed on a common pipeline as part of the recount2 project. We use gene expression data to build and evaluate predictors for both biological phenotypes (sex, tissue, sample source) and experimental conditions (sequencing strategy). We demonstrate how these predictions can be used to study cross-sample properties of public genomic data, select genomic projects with specific characteristics, and perform downstream analyses using predicted phenotypes. The methods to perform phenotype prediction are available in the phenopredict R package and the predictions for recount2 are available from the recount R package. With data and phenotype information available for 70,000 human samples, expression data is available for use on a scale that was not previously feasible.


Assuntos
Perfilação da Expressão Gênica , Fenótipo , Análise de Sequência de RNA , Simulação por Computador , Feminino , Humanos , Masculino , Software
9.
Proc Natl Acad Sci U S A ; 114(27): 7130-7135, 2017 07 03.
Artigo em Inglês | MEDLINE | ID: mdl-28634288

RESUMO

RNA sequencing (RNA-seq) is a powerful approach for measuring gene expression levels in cells and tissues, but it relies on high-quality RNA. We demonstrate here that statistical adjustment using existing quality measures largely fails to remove the effects of RNA degradation when RNA quality associates with the outcome of interest. Using RNA-seq data from molecular degradation experiments of human primary tissues, we introduce a method-quality surrogate variable analysis (qSVA)-as a framework for estimating and removing the confounding effect of RNA quality in differential expression analysis. We show that this approach results in greatly improved replication rates (>3×) across two large independent postmortem human brain studies of schizophrenia and also removes potential RNA quality biases in earlier published work that compared expression levels of different brain regions and other diagnostic groups. Our approach can therefore improve the interpretation of differential expression analysis of transcriptomic data from human tissue.


Assuntos
RNA/análise , Análise de Sequência de RNA/métodos , Algoritmos , Animais , Biologia Computacional , Replicação do DNA , Perfilação da Expressão Gênica , Regulação da Expressão Gênica , Genótipo , Substância Cinzenta , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência com Séries de Oligonucleotídeos , RNA/genética , Esquizofrenia/genética , Esquizofrenia/metabolismo , Transcriptoma
10.
Nucleic Acids Res ; 45(2): e9, 2017 01 25.
Artigo em Inglês | MEDLINE | ID: mdl-27694310

RESUMO

Differential expression analysis of RNA sequencing (RNA-seq) data typically relies on reconstructing transcripts or counting reads that overlap known gene structures. We previously introduced an intermediate statistical approach called differentially expressed region (DER) finder that seeks to identify contiguous regions of the genome showing differential expression signal at single base resolution without relying on existing annotation or potentially inaccurate transcript assembly.We present the derfinder software that improves our annotation-agnostic approach to RNA-seq analysis by: (i) implementing a computationally efficient bump-hunting approach to identify DERs that permits genome-scale analyses in a large number of samples, (ii) introducing a flexible statistical modeling framework, including multi-group and time-course analyses and (iii) introducing a new set of data visualizations for expressed region analysis. We apply this approach to public RNA-seq data from the Genotype-Tissue Expression (GTEx) project and BrainSpan project to show that derfinder permits the analysis of hundreds of samples at base resolution in R, identifies expression outside of known gene boundaries and can be used to visualize expressed regions at base-resolution. In simulations, our base resolution approaches enable discovery in the presence of incomplete annotation and is nearly as powerful as feature-level methods when the annotation is complete.derfinder analysis using expressed region-level and single base-level approaches provides a compromise between full transcript reconstruction and feature-level analysis. The package is available from Bioconductor at www.bioconductor.org/packages/derfinder.


Assuntos
Perfilação da Expressão Gênica/métodos , Software , Regulação da Expressão Gênica , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Anotação de Sequência Molecular , Especificidade de Órgãos/genética , Transcriptoma , Navegador
11.
Hum Mol Genet ; 25(22): 4962-4982, 2016 11 15.
Artigo em Inglês | MEDLINE | ID: mdl-28171598

RESUMO

We performed a thorough characterization of expressed repetitive element loci (RE) in the human orbitofrontal cortex (OFC) using directional RNA sequencing data. Considering only sequencing reads that map uniquely onto the human genome, we discovered that the overwhelming majority of intronic and exonic RE are expressed in the same orientation as the gene in which they reside. Our mapping approach enabled the identification of novel differentially expressed RE transcripts between the OFC and peripheral blood lymphocytes. Further analysis revealed that RE are extensively spliced into coding regions of gene transcripts yielding thousands of novel mRNA variants with altered coding potential. Lower frequency splicing of RE into untranslated regions of gene transcripts was also observed. The same pattern of RE splicing in the brain was also detected for Drosophila, zebrafish, mouse, rat, dog and rabbit. RE splicing occurs largely at canonical GT-AG splice junctions with LINE and SINE elements forming the most RE splice junctions in the human OFC. This type of splicing usually gives rise to a minor splice variant of the endogenous gene and in silico analysis suggests that RE splicing has the potential to introduce novel open reading frames. Reanalysis of previously published sequencing data performed in the mouse cerebellum revealed that thousands of RE splice variants are associated with translating ribosomes. Our results demonstrate that RE expression is more complex than previously envisioned and raise the possibility that RE splicing might generate functional protein isoforms.


Assuntos
Sequências Repetitivas Dispersas/genética , Sítios de Splice de RNA/genética , Splicing de RNA/genética , Processamento Alternativo/genética , Animais , Sequência de Bases , Encéfalo/metabolismo , DNA/genética , Éxons , Perfilação da Expressão Gênica/métodos , Genoma/genética , Humanos , Íntrons , Fases de Leitura Aberta/genética , Córtex Pré-Frontal/metabolismo , Isoformas de Proteínas/genética , RNA Mensageiro/genética , Sequências Repetitivas de Ácido Nucleico/genética , Análise de Sequência de RNA , Regiões não Traduzidas/genética
12.
Bioinformatics ; 33(24): 4033-4040, 2017 Dec 15.
Artigo em Inglês | MEDLINE | ID: mdl-27592709

RESUMO

MOTIVATION: RNA sequencing (RNA-seq) experiments now span hundreds to thousands of samples. Current spliced alignment software is designed to analyze each sample separately. Consequently, no information is gained from analyzing multiple samples together, and it requires extra work to obtain analysis products that incorporate data from across samples. RESULTS: We describe Rail-RNA, a cloud-enabled spliced aligner that analyzes many samples at once. Rail-RNA eliminates redundant work across samples, making it more efficient as samples are added. For many samples, Rail-RNA is more accurate than annotation-assisted aligners. We use Rail-RNA to align 667 RNA-seq samples from the GEUVADIS project on Amazon Web Services in under 16 h for US$0.91 per sample. Rail-RNA outputs alignments in SAM/BAM format; but it also outputs (i) base-level coverage bigWigs for each sample; (ii) coverage bigWigs encoding normalized mean and median coverages at each base across samples analyzed; and (iii) exon-exon splice junctions and indels (features) in columnar formats that juxtapose coverages in samples in which a given feature is found. Supplementary outputs are ready for use with downstream packages for reproducible statistical analysis. We use Rail-RNA to identify expressed regions in the GEUVADIS samples and show that both annotated and unannotated (novel) expressed regions exhibit consistent patterns of variation across populations and with respect to known confounding variables. AVAILABILITY AND IMPLEMENTATION: Rail-RNA is open-source software available at http://rail.bio. CONTACTS: anellore@gmail.com or langmea@cs.jhu.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Splicing de RNA , Alinhamento de Sequência/métodos , Análise de Sequência de RNA/métodos , Software , Éxons , Perfilação da Expressão Gênica
13.
Bioinformatics ; 32(16): 2551-3, 2016 08 15.
Artigo em Inglês | MEDLINE | ID: mdl-27153614

RESUMO

MOTIVATION: Public archives contain thousands of trillions of bases of valuable sequencing data. More than 40% of the Sequence Read Archive is human data protected by provisions such as dbGaP. To analyse dbGaP-protected data, researchers must typically work with IT administrators and signing officials to ensure all levels of security are implemented at their institution. This is a major obstacle, impeding reproducibility and reducing the utility of archived data. RESULTS: We present a protocol and software tool for analyzing protected data in a commercial cloud. The protocol, Rail-dbGaP, is applicable to any tool running on Amazon Web Services Elastic MapReduce. The tool, Rail-RNA v0.2, is a spliced aligner for RNA-seq data, which we demonstrate by running on 9662 samples from the dbGaP-protected GTEx consortium dataset. The Rail-dbGaP protocol makes explicit for the first time the steps an investigator must take to develop Elastic MapReduce pipelines that analyse dbGaP-protected data in a manner compliant with NIH guidelines. Rail-RNA automates implementation of the protocol, making it easy for typical biomedical investigators to study protected RNA-seq data, regardless of their local IT resources or expertise. AVAILABILITY AND IMPLEMENTATION: Rail-RNA is available from http://rail.bio Technical details on the Rail-dbGaP protocol as well as an implementation walkthrough are available at https://github.com/nellore/rail-dbgap Detailed instructions on running Rail-RNA on dbGaP-protected data using Amazon Web Services are available at http://docs.rail.bio/dbgap/ CONTACTS: : anellore@gmail.com or langmea@cs.jhu.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional , Bases de Dados Genéticas , Software , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , RNA , Reprodutibilidade dos Testes
14.
Bioinformatics ; 32(24): 3836-3838, 2016 12 15.
Artigo em Inglês | MEDLINE | ID: mdl-27540268

RESUMO

Sequencing and microarray samples often are collected or processed in multiple batches or at different times. This often produces technical biases that can lead to incorrect results in the downstream analysis. There are several existing batch adjustment tools for '-omics' data, but they do not indicate a priori whether adjustment needs to be conducted or how correction should be applied. We present a software pipeline, BatchQC, which addresses these issues using interactive visualizations and statistics that evaluate the impact of batch effects in a genomic dataset. BatchQC can also apply existing adjustment tools and allow users to evaluate their benefits interactively. We used the BatchQC pipeline on both simulated and real data to demonstrate the effectiveness of this software toolkit. AVAILABILITY AND IMPLEMENTATION: BatchQC is available through Bioconductor: http://bioconductor.org/packages/BatchQC and GitHub: https://github.com/mani2012/BatchQC CONTACT: wej@bu.eduSupplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional/métodos , Genômica/métodos , Software , Genoma , Humanos , Interface Usuário-Computador
15.
Nature ; 478(7370): 519-23, 2011 Oct 26.
Artigo em Inglês | MEDLINE | ID: mdl-22031444

RESUMO

Previous investigations have combined transcriptional and genetic analyses in human cell lines, but few have applied these techniques to human neural tissue. To gain a global molecular perspective on the role of the human genome in cortical development, function and ageing, we explore the temporal dynamics and genetic control of transcription in human prefrontal cortex in an extensive series of post-mortem brains from fetal development through ageing. We discover a wave of gene expression changes occurring during fetal development which are reversed in early postnatal life. One half-century later in life, this pattern of reversals is mirrored in ageing and in neurodegeneration. Although we identify thousands of robust associations of individual genetic polymorphisms with gene expression, we also demonstrate that there is no association between the total extent of genetic differences between subjects and the global similarity of their transcriptional profiles. Hence, the human genome produces a consistent molecular architecture in the prefrontal cortex, despite millions of genetic differences across individuals and races. To enable further discovery, this entire data set is freely available (from Gene Expression Omnibus: accession GSE30272; and dbGaP: accession phs000417.v1.p1) and can also be interrogated via a biologist-friendly stand-alone application (http://www.libd.org/braincloud).


Assuntos
Envelhecimento/genética , Perfilação da Expressão Gênica , Regulação da Expressão Gênica no Desenvolvimento/genética , Córtex Pré-Frontal/crescimento & desenvolvimento , Córtex Pré-Frontal/metabolismo , Transcriptoma/genética , Autopsia , Feto/metabolismo , Genoma Humano/genética , Humanos , Polimorfismo de Nucleotídeo Único/genética , Córtex Pré-Frontal/embriologia , Grupos Raciais/genética , Fatores de Tempo
16.
Bioinformatics ; 31(17): 2778-84, 2015 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-25926345

RESUMO

MOTIVATION: Statistical methods development for differential expression analysis of RNA sequencing (RNA-seq) requires software tools to assess accuracy and error rate control. Since true differential expression status is often unknown in experimental datasets, artificially constructed datasets must be utilized, either by generating costly spike-in experiments or by simulating RNA-seq data. RESULTS: Polyester is an R package designed to simulate RNA-seq data, beginning with an experimental design and ending with collections of RNA-seq reads. Its main advantage is the ability to simulate reads indicating isoform-level differential expression across biological replicates for a variety of experimental designs. Data generated by Polyester is a reasonable approximation to real RNA-seq data and standard differential expression workflows can recover differential expression set in the simulation by the user. AVAILABILITY AND IMPLEMENTATION: Polyester is freely available from Bioconductor (http://bioconductor.org/). CONTACT: jtleek@gmail.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Cromossomos Humanos Par 22/genética , Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de RNA/métodos , Software , Algoritmos , Distribuição Binomial , Europa (Continente) , Regulação da Expressão Gênica , Genética Populacional , Haplótipos/genética , Humanos , Isoformas de Proteínas , RNA/genética
17.
Bioinformatics ; 31(14): 2318-23, 2015 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-25788628

RESUMO

MOTIVATION: Prior to applying genomic predictors to clinical samples, the genomic data must be properly normalized to ensure that the test set data are comparable to the data upon which the predictor was trained. The most effective normalization methods depend on data from multiple patients. From a biomedical perspective, this implies that predictions for a single patient may change depending on which other patient samples they are normalized with. This test set bias will occur when any cross-sample normalization is used before clinical prediction. RESULTS: We demonstrate that results from existing gene signatures which rely on normalizing test data may be irreproducible when the patient population changes composition or size using a set of curated, publicly available breast cancer microarray experiments. As an alternative, we examine the use of gene signatures that rely on ranks from the data and show why signatures using rank-based features can avoid test set bias while maintaining highly accurate classification, even across platforms. AVAILABILITY AND IMPLEMENTATION: The code, data and instructions necessary to reproduce our entire analysis is available at https://github.com/prpatil/testsetbias.


Assuntos
Biomarcadores Tumorais/genética , Neoplasias da Mama/genética , Perfilação da Expressão Gênica/métodos , Genômica/métodos , Modelos Estatísticos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Neoplasias da Mama/mortalidade , Neoplasias da Mama/patologia , Feminino , Regulação Neoplásica da Expressão Gênica , Humanos , Pessoa de Meia-Idade , Gradação de Tumores , Estadiamento de Neoplasias , Prognóstico , Receptor ErbB-2/metabolismo , Receptores de Estrogênio/metabolismo , Reprodutibilidade dos Testes , Taxa de Sobrevida
18.
Nat Rev Genet ; 11(10): 733-9, 2010 10.
Artigo em Inglês | MEDLINE | ID: mdl-20838408

RESUMO

High-throughput technologies are widely used, for example to assay genetic variants, gene and protein expression, and epigenetic modifications. One often overlooked complication with such studies is batch effects, which occur because measurements are affected by laboratory conditions, reagent lots and personnel differences. This becomes a major problem when batch effects are correlated with an outcome of interest and lead to incorrect conclusions. Using both published studies and our own analyses, we argue that batch effects (as well as other technical and biological artefacts) are widespread and critical to address. We review experimental and computational approaches for doing so.


Assuntos
Biotecnologia/métodos , Genômica/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Análise de Sequência de DNA/métodos , Biotecnologia/normas , Biotecnologia/estatística & dados numéricos , Biologia Computacional/métodos , Genômica/normas , Genômica/estatística & dados numéricos , Análise de Sequência com Séries de Oligonucleotídeos/normas , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Publicações Periódicas como Assunto/normas , Projetos de Pesquisa/normas , Projetos de Pesquisa/estatística & dados numéricos , Análise de Sequência de DNA/normas , Análise de Sequência de DNA/estatística & dados numéricos
20.
Nucleic Acids Res ; 42(21)2014 Dec 01.
Artigo em Inglês | MEDLINE | ID: mdl-25294822

RESUMO

It is now known that unwanted noise and unmodeled artifacts such as batch effects can dramatically reduce the accuracy of statistical inference in genomic experiments. These sources of noise must be modeled and removed to accurately measure biological variability and to obtain correct statistical inference when performing high-throughput genomic analysis. We introduced surrogate variable analysis (sva) for estimating these artifacts by (i) identifying the part of the genomic data only affected by artifacts and (ii) estimating the artifacts with principal components or singular vectors of the subset of the data matrix. The resulting estimates of artifacts can be used in subsequent analyses as adjustment factors to correct analyses. Here I describe a version of the sva approach specifically created for count data or FPKMs from sequencing experiments based on appropriate data transformation. I also describe the addition of supervised sva (ssva) for using control probes to identify the part of the genomic data only affected by artifacts. I present a comparison between these versions of sva and other methods for batch effect estimation on simulated data, real count-based data and FPKM-based data. These updates are available through the sva Bioconductor package and I have made fully reproducible analysis using these methods available from: https://github.com/jtleek/svaseq.


Assuntos
Artefatos , Perfilação da Expressão Gênica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Algoritmos , Animais , Genômica/métodos , Peixe-Zebra/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA