Next-generation data filtering in the genomics era.

Hemstrom, William; Grummer, Jared A; Luikart, Gordon; Christie, Mark R

Hemstrom, William; Grummer, Jared A; Luikart, Gordon; Christie, Mark R.

Afiliação

Hemstrom W; Department of Biological Sciences, Purdue University, West Lafayette, IN, USA. whemstro@purdue.edu.
Grummer JA; Flathead Lake Biological Station, Wildlife Biology Program and Division of Biological Sciences, University of Montana, Missoula, MT, USA.
Luikart G; Flathead Lake Biological Station, Wildlife Biology Program and Division of Biological Sciences, University of Montana, Missoula, MT, USA.
Christie MR; Department of Biological Sciences, Purdue University, West Lafayette, IN, USA. christ99@purdue.edu.

Nat Rev Genet ; 2024 Jun 14.

Article em En | MEDLINE | ID: mdl-38877133

ABSTRACT

ABSTRACT

Genomic data are ubiquitous across disciplines, from agriculture to biodiversity, ecology, evolution and human health. However, these datasets often contain noise or errors and are missing information that can affect the accuracy and reliability of subsequent computational analyses and conclusions. A key step in genomic data analysis is filtering - removing sequencing bases, reads, genetic variants and/or individuals from a dataset - to improve data quality for downstream analyses. Researchers are confronted with a multitude of choices when filtering genomic data; they must choose which filters to apply and select appropriate thresholds. To help usher in the next generation of genomic data filtering, we review and suggest best practices to improve the implementation, reproducibility and reporting standards for filter types and thresholds commonly applied to genomic datasets. We focus mainly on filters for minor allele frequency, missing data per individual or per locus, linkage disequilibrium and Hardy-Weinberg deviations. Using simulated and empirical datasets, we illustrate the large effects of different filtering thresholds on common population genetics statistics, such as Tajima's D value, population differentiation (FST), nucleotide diversity (π) and effective population size (Ne).

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article

Texto completo

Imprimir

XML

PubMed Links

Buscar no Google

Texto completo: 1 Coleções: 01-internacional Base de dados: MEDLINE Idioma: En Ano de publicação: 2024 Tipo de documento: Article