Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 46
Filtrar
1.
PLoS Genet ; 20(2): e1010657, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-38377104

RESUMO

A growing body of evidence suggests that gene flow between closely related species is a widespread phenomenon. Alleles that introgress from one species into a close relative are typically neutral or deleterious, but sometimes confer a significant fitness advantage. Given the potential relevance to speciation and adaptation, numerous methods have therefore been devised to identify regions of the genome that have experienced introgression. Recently, supervised machine learning approaches have been shown to be highly effective for detecting introgression. One especially promising approach is to treat population genetic inference as an image classification problem, and feed an image representation of a population genetic alignment as input to a deep neural network that distinguishes among evolutionary models (i.e. introgression or no introgression). However, if we wish to investigate the full extent and fitness effects of introgression, merely identifying genomic regions in a population genetic alignment that harbor introgressed loci is insufficient-ideally we would be able to infer precisely which individuals have introgressed material and at which positions in the genome. Here we adapt a deep learning algorithm for semantic segmentation, the task of correctly identifying the type of object to which each individual pixel in an image belongs, to the task of identifying introgressed alleles. Our trained neural network is thus able to infer, for each individual in a two-population alignment, which of those individual's alleles were introgressed from the other population. We use simulated data to show that this approach is highly accurate, and that it can be readily extended to identify alleles that are introgressed from an unsampled "ghost" population, performing comparably to a supervised learning method tailored specifically to that task. Finally, we apply this method to data from Drosophila, showing that it is able to accurately recover introgressed haplotypes from real data. This analysis reveals that introgressed alleles are typically confined to lower frequencies within genic regions, suggestive of purifying selection, but are found at much higher frequencies in a region previously shown to be affected by adaptive introgression. Our method's success in recovering introgressed haplotypes in challenging real-world scenarios underscores the utility of deep learning approaches for making richer evolutionary inferences from genomic data.


Assuntos
Genética Populacional , Semântica , Humanos , Alelos , Genômica , Evolução Biológica
2.
Am J Hum Genet ; 109(11): 1986-1997, 2022 11 03.
Artigo em Inglês | MEDLINE | ID: mdl-36198314

RESUMO

Whole-genome sequencing (WGS) is the gold standard for fully characterizing genetic variation but is still prohibitively expensive for large samples. To reduce costs, many studies sequence only a subset of individuals or genomic regions, and genotype imputation is used to infer genotypes for the remaining individuals or regions without sequencing data. However, not all variants can be well imputed, and the current state-of-the-art imputation quality metric, denoted as standard Rsq, is poorly calibrated for lower-frequency variants. Here, we propose MagicalRsq, a machine-learning-based method that integrates variant-level imputation and population genetics statistics, to provide a better calibrated imputation quality metric. Leveraging WGS data from the Cystic Fibrosis Genome Project (CFGP), and whole-exome sequence data from UK BioBank (UKB), we performed comprehensive experiments to evaluate the performance of MagicalRsq compared to standard Rsq for partially sequenced studies. We found that MagicalRsq aligns better with true R2 than standard Rsq in almost every situation evaluated, for both European and African ancestry samples. For example, when applying models trained from 1,992 CFGP sequenced samples to an independent 3,103 samples with no sequencing but TOPMed imputation from array genotypes, MagicalRsq, compared to standard Rsq, achieved net gains of 1.4 million rare, 117k low-frequency, and 18k common variants, where net gains were gained numbers of correctly distinguished variants by MagicalRsq over standard Rsq. MagicalRsq can serve as an improved post-imputation quality metric and will benefit downstream analysis by better distinguishing well-imputed variants from those poorly imputed. MagicalRsq is freely available on GitHub.


Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Humanos , Estudo de Associação Genômica Ampla/métodos , Polimorfismo de Nucleotídeo Único/genética , Calibragem , Genótipo , Aprendizado de Máquina
3.
Mol Biol Evol ; 40(4)2023 04 04.
Artigo em Inglês | MEDLINE | ID: mdl-36947126

RESUMO

Gene flow between previously differentiated populations during the founding of an admixed or hybrid population has the potential to introduce adaptive alleles into the new population. If the adaptive allele is common in one source population, but not the other, then as the adaptive allele rises in frequency in the admixed population, genetic ancestry from the source containing the adaptive allele will increase nearby as well. Patterns of genetic ancestry have therefore been used to identify post-admixture positive selection in humans and other animals, including examples in immunity, metabolism, and animal coloration. A common method identifies regions of the genome that have local ancestry "outliers" compared with the distribution across the rest of the genome, considering each locus independently. However, we lack theoretical models for expected distributions of ancestry under various demographic scenarios, resulting in potential false positives and false negatives. Further, ancestry patterns between distant sites are often not independent. As a result, current methods tend to infer wide genomic regions containing many genes as under selection, limiting biological interpretation. Instead, we develop a deep learning object detection method applied to images generated from local ancestry-painted genomes. This approach preserves information from the surrounding genomic context and avoids potential pitfalls of user-defined summary statistics. We find the method is robust to a variety of demographic misspecifications using simulated data. Applied to human genotype data from Cabo Verde, we localize a known adaptive locus to a single narrow region compared with multiple or long windows obtained using two other ancestry-based methods.


Assuntos
Genética Populacional , Genômica , Animais , Humanos , Genômica/métodos , Genótipo , Fluxo Gênico , Cromossomos
4.
Mol Biol Evol ; 40(4)2023 04 04.
Artigo em Inglês | MEDLINE | ID: mdl-36971242

RESUMO

Aedes aegypti vectors the pathogens that cause dengue, yellow fever, Zika virus, and chikungunya and is a serious threat to public health in tropical regions. Decades of work has illuminated many aspects of Ae. aegypti's biology and global population structure and has identified insecticide resistance genes; however, the size and repetitive nature of the Ae. aegypti genome have limited our ability to detect positive selection in this mosquito. Combining new whole genome sequences from Colombia with publicly available data from Africa and the Americas, we identify multiple strong candidate selective sweeps in Ae. aegypti, many of which overlap genes linked to or implicated in insecticide resistance. We examine the voltage-gated sodium channel gene in three American cohorts and find evidence for successive selective sweeps in Colombia. The most recent sweep encompasses an intermediate-frequency haplotype containing four candidate insecticide resistance mutations that are in near-perfect linkage disequilibrium with one another in the Colombian sample. We hypothesize that this haplotype may continue to rapidly increase in frequency and perhaps spread geographically in the coming years. These results extend our knowledge of how insecticide resistance has evolved in this species and add to a growing body of evidence suggesting that Ae. aegypti has an extensive genomic capacity to rapidly adapt to insecticide-based vector control.


Assuntos
Aedes , Genoma de Inseto , Resistência a Inseticidas , Inseticidas , Animais , Aedes/genética , Dengue , Resistência a Inseticidas/genética , Inseticidas/farmacologia , Mosquitos Vetores/genética , Mutação , Zika virus , Infecção por Zika virus , Genoma de Inseto/efeitos dos fármacos , Genoma de Inseto/genética
5.
Syst Biol ; 71(3): 526-546, 2022 04 19.
Artigo em Inglês | MEDLINE | ID: mdl-34324671

RESUMO

Introgression is an important biological process affecting at least 10% of the extant species in the animal kingdom. Introgression significantly impacts inference of phylogenetic species relationships where a strictly binary tree model cannot adequately explain reticulate net-like species relationships. Here, we use phylogenomic approaches to understand patterns of introgression along the evolutionary history of a unique, nonmodel insect system: dragonflies and damselflies (Odonata). We demonstrate that introgression is a pervasive evolutionary force across various taxonomic levels within Odonata. In particular, we show that the morphologically "intermediate" species of Anisozygoptera (one of the three primary suborders within Odonata besides Zygoptera and Anisoptera), which retain phenotypic characteristics of the other two suborders, experienced high levels of introgression likely coming from zygopteran genomes. Additionally, we find evidence for multiple cases of deep inter-superfamilial ancestral introgression. [Gene flow; Odonata; phylogenomics; reticulate evolution.].


Assuntos
Odonatos , Animais , Genoma , Insetos/anatomia & histologia , Odonatos/anatomia & histologia , Odonatos/genética , Filogenia
6.
Mol Biol Evol ; 38(3): 1168-1183, 2021 03 09.
Artigo em Inglês | MEDLINE | ID: mdl-33022051

RESUMO

Identification of partial sweeps, which include both hard and soft sweeps that have not currently reached fixation, provides crucial information about ongoing evolutionary responses. To this end, we introduce partialS/HIC, a deep learning method to discover selective sweeps from population genomic data. partialS/HIC uses a convolutional neural network for image processing, which is trained with a large suite of summary statistics derived from coalescent simulations incorporating population-specific history, to distinguish between completed versus partial sweeps, hard versus soft sweeps, and regions directly affected by selection versus those merely linked to nearby selective sweeps. We perform several simulation experiments under various demographic scenarios to demonstrate partialS/HIC's performance, which exhibits excellent resolution for detecting partial sweeps. We also apply our classifier to whole genomes from eight mosquito populations sampled across sub-Saharan Africa by the Anopheles gambiae 1000 Genomes Consortium, elucidating both continent-wide patterns as well as sweeps unique to specific geographic regions. These populations have experienced intense insecticide exposure over the past two decades, and we observe a strong overrepresentation of sweeps at insecticide resistance loci. Our analysis thus provides a list of candidate adaptive loci that may be relevant to mosquito control efforts. More broadly, our supervised machine learning approach introduces a method to distinguish between completed and partial sweeps, as well as between hard and soft sweeps, under a variety of demographic scenarios. As whole-genome data rapidly accumulate for a greater diversity of organisms, partialS/HIC addresses an increasing demand for useful selection scan tools that can track in-progress evolutionary dynamics.


Assuntos
Anopheles/genética , Aprendizado Profundo , Resistência a Inseticidas/genética , Seleção Genética , Animais , Genoma de Inseto
7.
Trends Genet ; 34(4): 301-312, 2018 04.
Artigo em Inglês | MEDLINE | ID: mdl-29331490

RESUMO

As population genomic datasets grow in size, researchers are faced with the daunting task of making sense of a flood of information. To keep pace with this explosion of data, computational methodologies for population genetic inference are rapidly being developed to best utilize genomic sequence data. In this review we discuss a new paradigm that has emerged in computational population genomics: that of supervised machine learning (ML). We review the fundamentals of ML, discuss recent applications of supervised ML to population genetics that outperform competing methods, and describe promising future directions in this area. Ultimately, we argue that supervised ML is an important and underutilized tool that has considerable potential for the world of evolutionary genomics.


Assuntos
Mineração de Dados/métodos , Genética Populacional , Genoma Humano , Aprendizado de Máquina Supervisionado , Evolução Biológica , Conjuntos de Dados como Assunto , Humanos , Seleção Genética
8.
Syst Biol ; 69(2): 221-233, 2020 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-31504938

RESUMO

Reconstructing the phylogenetic relationships between species is one of the most formidable tasks in evolutionary biology. Multiple methods exist to reconstruct phylogenetic trees, each with their own strengths and weaknesses. Both simulation and empirical studies have identified several "zones" of parameter space where accuracy of some methods can plummet, even for four-taxon trees. Further, some methods can have undesirable statistical properties such as statistical inconsistency and/or the tendency to be positively misleading (i.e. assert strong support for the incorrect tree topology). Recently, deep learning techniques have made inroads on a number of both new and longstanding problems in biological research. In this study, we designed a deep convolutional neural network (CNN) to infer quartet topologies from multiple sequence alignments. This CNN can readily be trained to make inferences using both gapped and ungapped data. We show that our approach is highly accurate on simulated data, often outperforming traditional methods, and is remarkably robust to bias-inducing regions of parameter space such as the Felsenstein zone and the Farris zone. We also demonstrate that the confidence scores produced by our CNN can more accurately assess support for the chosen topology than bootstrap and posterior probability scores from traditional methods. Although numerous practical challenges remain, these findings suggest that the deep learning approaches such as ours have the potential to produce more accurate phylogenetic inferences.


Assuntos
Classificação/métodos , Aprendizado Profundo , Filogenia , Alinhamento de Sequência/métodos , Simulação por Computador , Redes Neurais de Computação
9.
PLoS Genet ; 14(4): e1007341, 2018 04.
Artigo em Inglês | MEDLINE | ID: mdl-29684059

RESUMO

Hybridization and gene flow between species appears to be common. Even though it is clear that hybridization is widespread across all surveyed taxonomic groups, the magnitude and consequences of introgression are still largely unknown. Thus it is crucial to develop the statistical machinery required to uncover which genomic regions have recently acquired haplotypes via introgression from a sister population. We developed a novel machine learning framework, called FILET (Finding Introgressed Loci via Extra-Trees) capable of revealing genomic introgression with far greater power than competing methods. FILET works by combining information from a number of population genetic summary statistics, including several new statistics that we introduce, that capture patterns of variation across two populations. We show that FILET is able to identify loci that have experienced gene flow between related species with high accuracy, and in most situations can correctly infer which population was the donor and which was the recipient. Here we describe a data set of outbred diploid Drosophila sechellia genomes, and combine them with data from D. simulans to examine recent introgression between these species using FILET. Although we find that these populations may have split more recently than previously appreciated, FILET confirms that there has indeed been appreciable recent introgression (some of which might have been adaptive) between these species, and reveals that this gene flow is primarily in the direction of D. simulans to D. sechellia.


Assuntos
Drosophila simulans/genética , Drosophila/genética , Genoma de Inseto , Aprendizado de Máquina Supervisionado , Animais , Simulação por Computador , Drosophila/classificação , Drosophila simulans/classificação , Evolução Molecular , Fluxo Gênico , Especiação Genética , Variação Genética , Genética Populacional , Haplótipos , Hibridização Genética , Modelos Genéticos , Software , Especificidade da Espécie , Aprendizado de Máquina Supervisionado/estatística & dados numéricos
10.
Proc Natl Acad Sci U S A ; 115(19): 5028-5033, 2018 05 08.
Artigo em Inglês | MEDLINE | ID: mdl-29686078

RESUMO

Evidence for adaptation to different climates in the model species Arabidopsis thaliana is seen in reciprocal transplant experiments, but the genetic basis of this adaptation remains poorly understood. Field-based quantitative trait locus (QTL) studies provide direct but low-resolution evidence for the genetic basis of local adaptation. Using high-resolution population genomic approaches, we examine local adaptation along previously identified genetic trade-off (GT) and conditionally neutral (CN) QTLs for fitness between locally adapted Italian and Swedish A. thaliana populations [Ågren J, et al. (2013) Proc Natl Acad Sci USA 110:21077-21082]. We find that genomic regions enriched in high FST SNPs colocalize with GT QTL peaks. Many of these high FST regions also colocalize with regions enriched for SNPs significantly correlated to climate in Eurasia and evidence of recent selective sweeps in Sweden. Examining unfolded site frequency spectra across genes containing high FST SNPs suggests GTs may be due to more recent adaptation in Sweden than Italy. Finally, we collapse a list of thousands of genes spanning GT QTLs to 42 genes that likely underlie the observed GTs and explore potential biological processes driving these trade-offs, from protein phosphorylation, to seed dormancy and longevity. Our analyses link population genomic analyses and field-based QTL studies of local adaptation, and emphasize that GTs play an important role in the process of local adaptation.


Assuntos
Adaptação Fisiológica/genética , Arabidopsis/genética , Genoma de Planta , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas , Itália , Suécia
11.
Mol Biol Evol ; 36(2): 220-238, 2019 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-30517664

RESUMO

Population-scale genomic data sets have given researchers incredible amounts of information from which to infer evolutionary histories. Concomitant with this flood of data, theoretical and methodological advances have sought to extract information from genomic sequences to infer demographic events such as population size changes and gene flow among closely related populations/species, construct recombination maps, and uncover loci underlying recent adaptation. To date, most methods make use of only one or a few summaries of the input sequences and therefore ignore potentially useful information encoded in the data. The most sophisticated of these approaches involve likelihood calculations, which require theoretical advances for each new problem, and often focus on a single aspect of the data (e.g., only allele frequency information) in the interest of mathematical and computational tractability. Directly interrogating the entirety of the input sequence data in a likelihood-free manner would thus offer a fruitful alternative. Here, we accomplish this by representing DNA sequence alignments as images and using a class of deep learning methods called convolutional neural networks (CNNs) to make population genetic inferences from these images. We apply CNNs to a number of evolutionary questions and find that they frequently match or exceed the accuracy of current methods. Importantly, we show that CNNs perform accurate evolutionary model selection and parameter estimation, even on problems that have not received detailed theoretical treatments. Thus, when applied to population genetic alignments, CNNs are capable of outperforming expert-derived statistical methods and offer a new path forward in cases where no likelihood approach exists.


Assuntos
Genética Populacional/métodos , Redes Neurais de Computação , Animais , Hibridização Genética , Recombinação Genética , Seleção Genética
12.
Genome Res ; 26(1): 60-9, 2016 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-26518480

RESUMO

Knowledge of the genome-wide rate and spectrum of mutations is necessary to understand the origin of disease and the genetic variation driving all evolutionary processes. Here, we provide a genome-wide analysis of the rate and spectrum of mutations obtained in two Daphnia pulex genotypes via separate mutation-accumulation (MA) experiments. Unlike most MA studies that utilize haploid, homozygous, or self-fertilizing lines, D. pulex can be propagated ameiotically while maintaining a naturally heterozygous, diploid genome, allowing the capture of the full spectrum of genomic changes that arise in a heterozygous state. While base-substitution mutation rates are similar to those in other multicellular eukaryotes (about 4 × 10(-9) per site per generation), we find that the rates of large-scale (>100 kb) de novo copy-number variants (CNVs) are significantly elevated relative to those seen in previous MA studies. The heterozygosity maintained in this experiment allowed for estimates of gene-conversion processes. While most of the conversion tract lengths we report are similar to those generated by meiotic processes, we also find larger tract lengths that are indicative of mitotic processes. Comparison of MA lines to natural isolates reveals that a majority of large-scale CNVs in natural populations are removed by purifying selection. The mutations observed here share similarities with disease-causing, complex, large-scale CNVs, thereby demonstrating that MA studies in D. pulex serve as a system for studying the processes leading to such alterations.


Assuntos
Daphnia/genética , Deleção de Genes , Duplicação Gênica , Taxa de Mutação , Animais , Variações do Número de Cópias de DNA , Evolução Molecular , Feminino , Estudos de Associação Genética , Variação Genética , Heterozigoto , Masculino , Análise de Sequência de DNA
13.
PLoS Genet ; 12(3): e1005928, 2016 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-26977894

RESUMO

Detecting the targets of adaptive natural selection from whole genome sequencing data is a central problem for population genetics. However, to date most methods have shown sub-optimal performance under realistic demographic scenarios. Moreover, over the past decade there has been a renewed interest in determining the importance of selection from standing variation in adaptation of natural populations, yet very few methods for inferring this model of adaptation at the genome scale have been introduced. Here we introduce a new method, S/HIC, which uses supervised machine learning to precisely infer the location of both hard and soft selective sweeps. We show that S/HIC has unrivaled accuracy for detecting sweeps under demographic histories that are relevant to human populations, and distinguishing sweeps from linked as well as neutrally evolving regions. Moreover, we show that S/HIC is uniquely robust among its competitors to model misspecification. Thus, even if the true demographic model of a population differs catastrophically from that specified by the user, S/HIC still retains impressive discriminatory power. Finally, we apply S/HIC to the case of resequencing data from human chromosome 18 in a European population sample, and demonstrate that we can reliably recover selective sweeps that have been identified earlier using less specific and sensitive methods.


Assuntos
Deriva Genética , Genética Populacional , Aprendizado de Máquina , Seleção Genética/genética , Cromossomos Humanos Par 18/genética , Genoma Humano , Haplótipos/genética , Humanos
14.
Mol Biol Evol ; 34(8): 1863-1877, 2017 08 01.
Artigo em Inglês | MEDLINE | ID: mdl-28482049

RESUMO

The degree to which adaptation in recent human evolution shapes genetic variation remains controversial. This is in part due to the limited evidence in humans for classic "hard selective sweeps", wherein a novel beneficial mutation rapidly sweeps through a population to fixation. However, positive selection may often proceed via "soft sweeps" acting on mutations already present within a population. Here, we examine recent positive selection across six human populations using a powerful machine learning approach that is sensitive to both hard and soft sweeps. We found evidence that soft sweeps are widespread and account for the vast majority of recent human adaptation. Surprisingly, our results also suggest that linked positive selection affects patterns of variation across much of the genome, and may increase the frequencies of deleterious mutations. Our results also reveal insights into the role of sexual selection, cancer risk, and central nervous system development in recent human evolution.


Assuntos
Adaptação Fisiológica/genética , Genoma Humano/genética , Aclimatação , Adaptação Biológica/genética , Bases de Dados de Ácidos Nucleicos , Evolução Molecular , Variação Genética/genética , Genética Populacional , Humanos , Aprendizado de Máquina , Mutação , Seleção Genética/genética
15.
Mol Biol Evol ; 33(5): 1308-16, 2016 05.
Artigo em Inglês | MEDLINE | ID: mdl-26809315

RESUMO

Genetic differentiation across populations that is maintained in the presence of gene flow is a hallmark of spatially varying selection. In Drosophila melanogaster, the latitudinal clines across the eastern coasts of Australia and North America appear to be examples of this type of selection, with recent studies showing that a substantial portion of the D. melanogaster genome exhibits allele frequency differentiation with respect to latitude on both continents. As of yet there has been no genome-wide examination of differentiated copy-number variants (CNVs) in these geographic regions, despite their potential importance for phenotypic variation in Drosophila and other taxa. Here, we present an analysis of geographic variation in CNVs in D. melanogaster. We also present the first genomic analysis of geographic variation for copy-number variation in the sister species, D. simulans, in order to investigate patterns of parallel evolution in these close relatives. In D. melanogaster we find hundreds of CNVs, many of which show parallel patterns of geographic variation on both continents, lending support to the idea that they are influenced by spatially varying selection. These findings support the idea that polymorphic CNVs contribute to local adaptation in D. melanogaster In contrast, we find very few CNVs in D. simulans that are geographically differentiated in parallel on both continents, consistent with earlier work suggesting that clinal patterns are weaker in this species.


Assuntos
Adaptação Fisiológica/genética , Variações do Número de Cópias de DNA , Drosophila melanogaster/genética , Animais , Evolução Biológica , Evolução Molecular , Feminino , Frequência do Gene , Variação Genética , Genética Populacional/métodos , Filogenia , Filogeografia/métodos , Polimorfismo de Nucleotídeo Único , Seleção Genética
16.
Bioinformatics ; 32(24): 3839-3841, 2016 12 15.
Artigo em Inglês | MEDLINE | ID: mdl-27559153

RESUMO

Here we describe discoal, a coalescent simulator able to generate population samples that include selective sweeps in a feature-rich, flexible manner. discoal can perform simulations conditioning on the fixation of an allele due to drift or either hard or soft sweeps-even those occurring a large genetic distance away from the simulated locus. discoal can simulate sweeps with recurrent mutation to the adaptive allele, recombination, and gene conversion, under non-equilibrium demographic histories and without specifying an allele frequency trajectory in advance. AVAILABILITY AND IMPLEMENTATION: discoal is implemented in the C programming language. Source code is freely available on GitHub (https://github.com/kern-lab/discoal) under a GNU General Public License. CONTACT: kern@dls.rutgers.edu or dan.schrider@rutgers.eduSupplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Alelos , Biologia Computacional/métodos , Genética Populacional/métodos , Software , Simulação por Computador , Frequência do Gene , Modelos Genéticos , Mutação , Linguagens de Programação , Processos Estocásticos
17.
PLoS Genet ; 9(1): e1003242, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23359205

RESUMO

The era of whole-genome sequencing has revealed that gene copy-number changes caused by duplication and deletion events have important evolutionary, functional, and phenotypic consequences. Recent studies have therefore focused on revealing the extent of variation in copy-number within natural populations of humans and other species. These studies have found a large number of copy-number variants (CNVs) in humans, many of which have been shown to have clinical or evolutionary importance. For the most part, these studies have failed to detect an important class of gene copy-number polymorphism: gene duplications caused by retrotransposition, which result in a new intron-less copy of the parental gene being inserted into a random location in the genome. Here we describe a computational approach leveraging next-generation sequence data to detect gene copy-number variants caused by retrotransposition (retroCNVs), and we report the first genome-wide analysis of these variants in humans. We find that retroCNVs account for a substantial fraction of gene copy-number differences between any two individuals. Moreover, we show that these variants may often result in expressed chimeric transcripts, underscoring their potential for the evolution of novel gene functions. By locating the insertion sites of these duplicates, we are able to show that retroCNVs have had an important role in recent human adaptation, and we also uncover evidence that positive selection may currently be driving multiple retroCNVs toward fixation. Together these findings imply that retroCNVs are an especially important class of polymorphism, and that future studies of copy-number variation should search for these variants in order to illuminate their potential evolutionary and functional relevance.


Assuntos
Biologia Computacional/métodos , Variações do Número de Cópias de DNA/genética , Duplicação Gênica , Retroelementos/genética , Sequência de Bases , Evolução Biológica , Mapeamento Cromossômico , Humanos , Íntrons , Fenótipo , Análise de Sequência de DNA , Deleção de Sequência
18.
PLoS Comput Biol ; 10(12): e1003998, 2014 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-25474019

RESUMO

Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.


Assuntos
Genoma/genética , Genômica/métodos , Análise de Sequência de DNA/métodos , Análise de Sequência de RNA/métodos , Animais , Galinhas/genética , Mapeamento Cromossômico , Drosophila melanogaster/genética , Pan troglodytes/genética
19.
Genome Res ; 21(12): 2087-95, 2011 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-22135405

RESUMO

Gene duplication via retrotransposition has been shown to be an important mechanism in evolution, affecting gene dosage and allowing for the acquisition of new gene functions. Although fixed retrotransposed genes have been found in a variety of species, very little effort has been made to identify retrogene polymorphisms. Here, we examine 37 Illumina-sequenced North American Drosophila melanogaster inbred lines and present the first ever data set and analysis of polymorphic retrogenes in Drosophila. We show that this type of polymorphism is quite common, with any two gametes in the North American population differing in the presence or absence of six retrogenes, accounting for ~13% of gene copy-number heterozygosity. These retrogenes were identified by a straightforward method that can be applied using any type of DNA sequencing data. We also use a variant of this method to conduct a genome-wide scan for intron presence/absence polymorphisms, and show that any two chromosomes in the population likely differ in the presence of multiple introns. We show that these polymorphisms are all in fact deletions rather than intron gain events present in the reference genome. Finally, by leveraging the known location of the parental genes that give rise to the retrogene polymorphisms, we provide direct evidence that natural selection is responsible for the excess of fixations of retrogenes moving off of the X chromosome in Drosophila. Further efforts to identify retrogene and intron presence/absence polymorphisms will undoubtedly improve our understanding of the evolution of gene copy number and gene structure.


Assuntos
Cromossomos de Insetos/genética , Dosagem de Genes/fisiologia , Genes de Insetos/fisiologia , Íntrons/fisiologia , Polimorfismo Genético/fisiologia , Cromossomo X/genética , Animais , Drosophila melanogaster , Feminino , Estudo de Associação Genômica Ampla , Masculino
20.
bioRxiv ; 2024 Apr 18.
Artigo em Inglês | MEDLINE | ID: mdl-38645049

RESUMO

Simulations are an essential tool in all areas of population genetic research, used in tasks such as the validation of theoretical analysis and the study of complex evolutionary models. Forward-in-time simulations are especially flexible, allowing for various types of natural selection, complex genetic architectures, and non-Wright-Fisher dynamics. However, their intense computational requirements can be prohibitive to simulating large populations and genomes. A popular method to alleviate this burden is to scale down the population size by some scaling factor while scaling up the mutation rate, selection coefficients, and recombination rate by the same factor. However, this rescaling approach may in some cases bias simulation results. To investigate the manner and degree to which rescaling impacts simulation outcomes, we carried out simulations with different demographic histories and distributions of fitness effects using several values of the rescaling factor, Q, and compared the deviation of key outcomes (fixation times, fixation probabilities, allele frequencies, and linkage disequilibrium) between the scaled and unscaled simulations. Our results indicate that scaling introduces substantial biases to each of these measured outcomes, even at small values of Q. Moreover, the nature of these effects depends on the evolutionary model and scaling factor being examined. While increasing the scaling factor tends to increase the observed biases, this relationship is not always straightforward, thus it may be difficult to know the impact of scaling on simulation outcomes a priori. However, it appears that for most models, only a small number of replicates was needed to accurately quantify the bias produced by rescaling for a given Q. In summary, while rescaling forward-in-time simulations may be necessary in many cases, researchers should be aware of the rescaling effect's impact on simulation outcomes and consider investigating its magnitude in smaller scale simulations of the desired model(s) before selecting an appropriate value of Q.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA