Optimized sample selection for cost-efficient long-read population sequencing.

Ranallo-Benavidez, T Rhyker; Lemmon, Zachary; Soyk, Sebastian; Aganezov, Sergey; Salerno, William J; McCoy, Rajiv C; Lippman, Zachary B; Schatz, Michael C; Sedlazeck, Fritz J

Ranallo-Benavidez, T Rhyker; Lemmon, Zachary; Soyk, Sebastian; Aganezov, Sergey; Salerno, William J; McCoy, Rajiv C; Lippman, Zachary B; Schatz, Michael C; Sedlazeck, Fritz J.

Afiliación

Ranallo-Benavidez TR; Johns Hopkins University, Baltimore, Maryland 21218, USA.
Lemmon Z; Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA.
Soyk S; Center for Integrative Genomics, University of Lausanne, Lausanne 1005, Switzerland.
Aganezov S; Johns Hopkins University, Baltimore, Maryland 21218, USA.
Salerno WJ; Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA.
McCoy RC; Johns Hopkins University, Baltimore, Maryland 21218, USA.
Lippman ZB; Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA.
Schatz MC; Howard Hughes Medical Institute, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA.
Sedlazeck FJ; Johns Hopkins University, Baltimore, Maryland 21218, USA.

Genome Res ; 31(5): 910-918, 2021 05.

Article en En | MEDLINE | ID: mdl-33811084

RESUMEN

An increasingly important scenario in population genetics is when a large cohort has been genotyped using a low-resolution approach (e.g., microarrays, exome capture, short-read WGS), from which a few individuals are resequenced using a more comprehensive approach, especially long-read sequencing. The subset of individuals selected should ensure that the captured genetic diversity is fully representative and includes variants across all subpopulations. For example, human variation has historically focused on individuals with European ancestry, but this represents a small fraction of the overall diversity. Addressing this, SVCollector identifies the optimal subset of individuals for resequencing by analyzing population-level VCF files from low-resolution genotyping studies. It then computes a ranked list of samples that maximizes the total number of variants present within a subset of a given size. To solve this optimization problem, SVCollector implements a fast, greedy heuristic and an exact algorithm using integer linear programming. We apply SVCollector on simulated data, 2504 human genomes from the 1000 Genomes Project, and 3024 genomes from the 3000 Rice Genomes Project and show the rankings it computes are more representative than alternative naive strategies. When selecting an optimal subset of 100 samples in these cohorts, SVCollector identifies individuals from every subpopulation, whereas naive methods yield an unbalanced selection. Finally, we show the number of variants present in cohorts selected using this approach follows a power-law distribution that is naturally related to the population genetic concept of the allele frequency spectrum, allowing us to estimate the diversity present with increasing numbers of samples.

Asunto(s)

Genoma Humano; Polimorfismo de Nucleótido Simple; Exoma/genética; Frecuencia de los Genes; Genética de Población; Humanos; Análisis de Secuencia de ADN/métodos

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Contexto en salud: 1_ASSA2030 Problema de salud: 1_financiamento_saude Asunto principal: Genoma Humano / Polimorfismo de Nucleótido Simple Tipo de estudio: Health_economic_evaluation / Prognostic_studies Límite: Humans Idioma: En Revista: Genome Res Asunto de la revista: BIOLOGIA MOLECULAR / GENETICA Año: 2021 Tipo del documento: Article País de afiliación: Estados Unidos

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google