Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 76
Filter
1.
Mamm Genome ; 2024 Aug 01.
Article in English | MEDLINE | ID: mdl-39088040

ABSTRACT

For over 15 years, canine genetics research relied on a reference assembly from a Boxer breed dog named Tasha (i.e., canFam3.1). Recent advances in long-read sequencing and genome assembly have led to the development of numerous high-quality assemblies from diverse canines. These assemblies represent notable improvements in completeness, contiguity, and the representation of gene promoters and gene models. Although genome graph and pan-genome approaches have promise, most genetic analyses in canines rely upon the mapping of Illumina sequencing reads to a single reference. The Dog10K consortium, and others, have generated deep catalogs of genetic variation through an alignment of Illumina sequencing reads to a reference genome obtained from a German Shepherd Dog named Mischka (i.e., canFam4, UU_Cfam_GSD_1.0). However, alignment to a breed-derived genome may introduce bias in genotype calling across samples. Since the use of an outgroup reference genome may remove this effect, we have reprocessed 1929 samples analyzed by the Dog10K consortium using a Greenland wolf (mCanLor1.2) as the reference. We efficiently performed remapping and variant calling using a GPU-implementation of common analysis tools. The resulting call set removes the variability in genetic differences seen across samples and breed relationships revealed by principal component analysis are not affected by the choice of reference genome. Using this sequence data, we inferred the history of population sizes and found that village dog populations experienced a 9-13 fold reduction in historic effective population size relative to wolves.

2.
Genome Biol Evol ; 16(7)2024 Jul 03.
Article in English | MEDLINE | ID: mdl-38946312

ABSTRACT

Recent years have seen a dramatic increase in the number of canine genome assemblies available. Duplications are an important source of evolutionary novelty and are also prone to misassembly. We explored the duplication content of nine canine genome assemblies using both genome self-alignment and read-depth approaches. We find that 8.58% of the genome is duplicated in the canFam4 assembly, derived from the German Shepherd Dog Mischka, including 90.15% of unplaced contigs. Highlighting the continued difficulty in properly assembling duplications, less than half of read-depth and assembly alignment duplications overlap, but the mCanLor1.2 Greenland wolf assembly shows greater concordance. Further study shows the presence of multiple segments that have alignments to four or more duplicate copies. These high-recurrence duplications correspond to gene retrocopies. We identified 3,892 candidate retrocopies from 1,316 parental genes in the canFam4 assembly and find that ∼8.82% of duplicated base pairs involve a retrocopy, confirming this mechanism as a major driver of gene duplication in canines. Similar patterns are found across eight other recent canine genome assemblies, with metrics supporting a greater quality of the PacBio HiFi mCanLor1.2 assembly. Comparison between the wolf and other canine assemblies found that 92% of retrocopy insertions are shared between assemblies. By calculating the number of generations since genome divergence, we estimate that new retrocopy insertions appear, on average, in 1 out of 3,514 births. Our analyses illustrate the impact of retrogene formation on canine genomes and highlight the variable representation of duplicated sequences among recently completed canine assemblies.


Subject(s)
Gene Duplication , Genome , Dogs/genetics , Animals , Genomics , Evolution, Molecular , Retroelements
4.
Nucleic Acids Res ; 52(13): 7761-7779, 2024 Jul 22.
Article in English | MEDLINE | ID: mdl-38850156

ABSTRACT

Alu elements are non-autonomous Short INterspersed Elements (SINEs) derived from the 7SL RNA gene that are present at over one million copies in human genomic DNA. Alu mobilizes by a mechanism known as retrotransposition, which requires the Long INterspersed Element-1 (LINE-1) ORF2-encoded protein (ORF2p). Here, we demonstrate that HeLa strains differ in their capacity to support Alu retrotransposition. Human Alu elements retrotranspose efficiently in HeLa-HA and HeLa-CCL2 (Alu-permissive) strains, but not in HeLa-JVM or HeLa-H1 (Alu-nonpermissive) strains. A similar pattern of retrotransposition was observed for other 7SL RNA-derived SINEs and tRNA-derived SINEs. In contrast, mammalian LINE-1s, a zebrafish LINE, a human SINE-VNTR-Alu (SVA) element, and an L1 ORF1-containing mRNA can retrotranspose in all four HeLa strains. Using an in vitro reverse transcriptase-based assay, we show that Alu RNAs associate with ORF2p and are converted into cDNAs in both Alu-permissive and Alu-nonpermissive HeLa strains, suggesting that 7SL- and tRNA-derived SINEs use strategies to 'hijack' L1 ORF2p that are distinct from those used by SVA elements and ORF1-containing mRNAs. These data further suggest ORF2p associates with the Alu RNA poly(A) tract in both Alu-permissive and Alu-nonpermissive HeLa strains, but that Alu retrotransposition is blocked after this critical step in Alu-nonpermissive HeLa strains.


Subject(s)
Alu Elements , Long Interspersed Nucleotide Elements , Humans , HeLa Cells , Alu Elements/genetics , Long Interspersed Nucleotide Elements/genetics , Short Interspersed Nucleotide Elements/genetics , Animals , Retroelements/genetics , RNA/genetics , RNA/metabolism , RNA, Transfer/genetics , RNA, Transfer/metabolism , Zebrafish/genetics
5.
bioRxiv ; 2024 May 03.
Article in English | MEDLINE | ID: mdl-38746229

ABSTRACT

Alu elements are non-autonomous Short INterspersed Elements (SINEs) derived from the 7SL RNA gene that are present at over one million copies in human genomic DNA. Alu mobilizes by a mechanism known as retrotransposition, which requires the Long INterspersed Element-1 (LINE-1 or L1) ORF2 -encoded protein (ORF2p). Here, we demonstrate that HeLa strains differ in their capacity to support Alu retrotransposition. Human Alu elements retrotranspose efficiently in HeLa-HA and HeLa-CCL2 ( Alu -permissive) strains, but not in HeLa-JVM or HeLa-H1 ( Alu -nonpermissive) strains. A similar pattern of retrotransposition was observed for other 7SL RNA -derived SINEs and tRNA -derived SINEs. In contrast, mammalian LINE-1s, a zebrafish LINE, a human SINE-VNTR - Alu ( SVA ) element, and an L1 ORF1 -containing messenger RNA can retrotranspose in all four HeLa strains. Using an in vitro reverse transcriptase-based assay, we show that Alu RNAs associate with ORF2p and are converted into cDNAs in both Alu -permissive and Alu -nonpermissive HeLa strains, suggesting that 7SL - and tRNA -derived SINE RNAs use strategies to 'hijack' L1 ORF2p that are distinct from those used by SVA elements and ORF1 -containing mRNAs. These data further suggest ORF2p associates with the Alu RNA poly(A) tract in both Alu -permissive and Alu -nonpermissive HeLa strains, but that Alu retrotransposition is blocked after this critical step in Alu -nonpermissive HeLa strains.

6.
Nat Commun ; 15(1): 4220, 2024 May 17.
Article in English | MEDLINE | ID: mdl-38760338

ABSTRACT

When somatic cells acquire complex karyotypes, they often are removed by the immune system. Mutant somatic cells that evade immune surveillance can lead to cancer. Neurons with complex karyotypes arise during neurotypical brain development, but neurons are almost never the origin of brain cancers. Instead, somatic mutations in neurons can bring about neurodevelopmental disorders, and contribute to the polygenic landscape of neuropsychiatric and neurodegenerative disease. A subset of human neurons harbors idiosyncratic copy number variants (CNVs, "CNV neurons"), but previous analyses of CNV neurons are limited by relatively small sample sizes. Here, we develop an allele-based validation approach, SCOVAL, to corroborate or reject read-depth based CNV calls in single human neurons. We apply this approach to 2,125 frontal cortical neurons from a neurotypical human brain. SCOVAL identifies 226 CNV neurons, which include a subclass of 65 CNV neurons with highly aberrant karyotypes containing whole or substantial losses on multiple chromosomes. Moreover, we find that CNV location appears to be nonrandom. Recurrent regions of neuronal genome rearrangement contain fewer, but longer, genes.


Subject(s)
DNA Copy Number Variations , Mosaicism , Neurons , Humans , Neurons/metabolism , Alleles
7.
Sci Data ; 10(1): 813, 2023 11 20.
Article in English | MEDLINE | ID: mdl-37985666

ABSTRACT

Somatic mosaicism is defined as an occurrence of two or more populations of cells having genomic sequences differing at given loci in an individual who is derived from a single zygote. It is a characteristic of multicellular organisms that plays a crucial role in normal development and disease. To study the nature and extent of somatic mosaicism in autism spectrum disorder, bipolar disorder, focal cortical dysplasia, schizophrenia, and Tourette syndrome, a multi-institutional consortium called the Brain Somatic Mosaicism Network (BSMN) was formed through the National Institute of Mental Health (NIMH). In addition to genomic data of affected and neurotypical brains, the BSMN also developed and validated a best practices somatic single nucleotide variant calling workflow through the analysis of reference brain tissue. These resources, which include >400 terabytes of data from 1087 subjects, are now available to the research community via the NIMH Data Archive (NDA) and are described here.


Subject(s)
Mental Disorders , Humans , Autism Spectrum Disorder/genetics , Brain , Genomics , Mosaicism , Genome, Human , Mental Disorders/genetics
8.
G3 (Bethesda) ; 13(11)2023 11 01.
Article in English | MEDLINE | ID: mdl-37681359

ABSTRACT

Recent advances in long-read sequencing have enabled the creation of reference-quality genome assemblies for multiple individuals within a species. In particular, 8 long-read genome assemblies have recently been published for the canine model (dogs and wolves). These assemblies were created using a range of sequencing and computational approaches, with only limited comparisons described among subsets of the assemblies. Here we present 3 high-quality de novo reference assemblies based upon Oxford Nanopore long-read sequencing: 2 Bernese Mountain Dogs (BD & OD) and a Cairn terrier (CA611). These breeds are of particular interest due to the enrichment of unresolved genetic disorders. Leveraging advancement in software technologies, we utilized published data of Labrador Retriever (Yella) to generate a new assembly, resulting in a ∼280-fold increase in continuity (N50 size of 91 kbp vs 25.75 Mbp). In conjunction with these 4 new assemblies, we uniformly assessed 8 existing assemblies for generalized quality metrics, sequence divergence, and a detailed BUSCO assessment. We identified a set of ∼400 conserved genes during the BUSCO analysis missing in all assemblies. Genome-wide methylation profiles were generated from the nanopore sequencing, resulting in broad concordance with existing whole-genome and reduced-representation bisulfite sequencing, while highlighting superior overage of mobile elements. These analyses demonstrate the ability of Nanopore sequencing to resolve the sequence and epigenetic profile of canine genomes.


Subject(s)
Nanopores , Dogs , Animals , Methylation , Genome , Sequence Analysis, DNA , Software , High-Throughput Nucleotide Sequencing
9.
Genome Biol ; 24(1): 187, 2023 08 15.
Article in English | MEDLINE | ID: mdl-37582787

ABSTRACT

BACKGROUND: The international Dog10K project aims to sequence and analyze several thousand canine genomes. Incorporating 20 × data from 1987 individuals, including 1611 dogs (321 breeds), 309 village dogs, 63 wolves, and four coyotes, we identify genomic variation across the canid family, setting the stage for detailed studies of domestication, behavior, morphology, disease susceptibility, and genome architecture and function. RESULTS: We report the analysis of > 48 M single-nucleotide, indel, and structural variants spanning the autosomes, X chromosome, and mitochondria. We discover more than 75% of variation for 239 sampled breeds. Allele sharing analysis indicates that 94.9% of breeds form monophyletic clusters and 25 major clades. German Shepherd Dogs and related breeds show the highest allele sharing with independent breeds from multiple clades. On average, each breed dog differs from the UU_Cfam_GSD_1.0 reference at 26,960 deletions and 14,034 insertions greater than 50 bp, with wolves having 14% more variants. Discovered variants include retrogene insertions from 926 parent genes. To aid functional prioritization, single-nucleotide variants were annotated with SnpEff and Zoonomia phyloP constraint scores. Constrained positions were negatively correlated with allele frequency. Finally, the utility of the Dog10K data as an imputation reference panel is assessed, generating high-confidence calls across varied genotyping platform densities including for breeds not included in the Dog10K collection. CONCLUSIONS: We have developed a dense dataset of 1987 sequenced canids that reveals patterns of allele sharing, identifies likely functional variants, informs breed structure, and enables accurate imputation. Dog10K data are publicly available.


Subject(s)
Wolves , Dogs , Animals , Wolves/genetics , Chromosome Mapping , Alleles , Polymorphism, Single Nucleotide , Nucleotides , Demography
10.
bioRxiv ; 2023 Mar 07.
Article in English | MEDLINE | ID: mdl-36945473

ABSTRACT

When somatic cells acquire complex karyotypes, they are removed by the immune system. Mutant somatic cells that evade immune surveillance can lead to cancer. Neurons with complex karyotypes arise during neurotypical brain development, but neurons are almost never the origin of brain cancers. Instead, somatic mutations in neurons can bring about neurodevelopmental disorders, and contribute to the polygenic landscape of neuropsychiatric and neurodegenerative disease. A subset of human neurons harbors idiosyncratic copy number variants (CNVs, "CNV neurons"), but previous analyses of CNV neurons have been limited by relatively small sample sizes. Here, we developed an allele-based validation approach, SCOVAL, to corroborate or reject read-depth based CNV calls in single human neurons. We applied this approach to 2,125 frontal cortical neurons from a neurotypical human brain. This approach identified 226 CNV neurons, as well as a class of CNV neurons with complex karyotypes containing whole or substantial losses on multiple chromosomes. Moreover, we found that CNV location appears to be nonrandom. Recurrent regions of neuronal genome rearrangement contained fewer, but longer, genes.

11.
Genome Res ; 2022 Aug 12.
Article in English | MEDLINE | ID: mdl-35961775

ABSTRACT

Gene retrocopies arise from the reverse transcription and insertion into the genome of processed mRNA transcripts. Although many retrocopies have acquired mutations that render them functionally inactive, most mammals retain active LINE-1 sequences capable of producing new retrocopies. New retrocopies, referred to as retro copy number variants (retroCNVs), may not be identified by standard variant calling techniques in high-throughput sequencing data. Although multiple functional FGF4 retroCNVs have been associated with skeletal dysplasias in dogs, the full landscape of canid retroCNVs has not been characterized. Here, retroCNV discovery was performed on a whole-genome sequencing data set of 293 canids from 76 breeds. We identified retroCNV parent genes via the presence of mRNA-specific 30-mers, and then identified retroCNV insertion sites through discordant read analysis. In total, we resolved insertion sites for 1911 retroCNVs from 1179 parent genes, 1236 of which appeared identical to their parent genes. Dogs had on average 54.1 total retroCNVs and 1.4 private retroCNVs. We found evidence of expression in testes for 12% (14/113) of the retroCNVs identified in six Golden Retrievers, including four chimeric transcripts, and 97 retroCNVs also had significantly elevated F ST across dog breeds, possibly indicating selection. We applied our approach to a subset of human genomes and detected an average of 4.2 retroCNVs per sample, highlighting a 13-fold relative increase of retroCNV frequency in dogs. Particularly in canids, retroCNVs are a largely unexplored source of genetic variation that can contribute to genome plasticity and that should be considered when investigating traits and diseases.

13.
mBio ; 13(2): e0374821, 2022 04 26.
Article in English | MEDLINE | ID: mdl-35384697

ABSTRACT

Integration site landscapes, clonal dynamics, and latency reversal with or without vpr were compared in HIV-1-infected Jurkat cell populations, and the properties of individual clones were defined. Clones differed in fractions of long terminal repeat (LTR)-active daughter cells, with some clones containing few to no LTR-active cells, while almost all cells were LTR active for others. Clones varied over 4 orders of magnitude in virus release per active cell. Proviruses in largely LTR-active clones were closer to preexisting enhancers and promoters than low-LTR-active clones. Unsurprisingly, major vpr+ clones contained fewer LTR-active cells than vpr- clones, and predominant vpr+ proviruses were farther from enhancers and promoters than those in vpr- pools. Distances to these marks among intact proviruses previously reported for antiretroviral therapy (ART)-suppressed patients revealed that patient integration sites were more similar to those in the vpr+ pool than to vpr- integrants. Complementing vpr-defective proviruses with vpr led to the rapid loss of highly LTR-active clones, indicating that the effect of Vpr on proviral populations occurred after integration. However, major clones in the complemented pool and its vpr- parent population did not differ in burst sizes. When the latency reactivation agents prostratin and JQ1 were applied separately or in combination, vpr+ and vpr- population-wide trends were similar, with dual-treatment enhancement being due in part to reactivated clones that did not respond to either drug applied separately. However, the expression signatures of individual clones differed between populations. These observations highlight how Vpr, exerting selective pressure on proviral epigenetic variation, can shape integration site landscapes, proviral expression patterns, and reactivation properties. IMPORTANCE A bedrock assumption in HIV-1 population modeling is that all active cells release the same amount of virus. However, the findings here revealed that when HIV-infected cells expand into clones, each clone differs in virus production. Reasoning that this variation in expression patterns constituted a population of clones from which differing subsets would prevail under differing environmental conditions, the cytotoxic HIV-1 protein Vpr was introduced, and population dynamics and expression properties were compared in the presence and absence of Vpr. The results showed that whereas most clones produced fairly continuous levels of virus in the absence of Vpr, its presence selected for a distinct subset of clones with properties reminiscent of persistent populations in patients, suggesting the possibility that the interclonal variation in expression patterns observed in culture may contribute to proviral persistence in vivo.


Subject(s)
HIV Seropositivity , HIV-1 , HIV-1/physiology , Humans , Jurkat Cells , Proviruses/genetics , vpr Gene Products, Human Immunodeficiency Virus/genetics , vpr Gene Products, Human Immunodeficiency Virus/metabolism
14.
Trends Genet ; 38(5): 514-515, 2022 05.
Article in English | MEDLINE | ID: mdl-35232612

Subject(s)
Dog Diseases , Animals , Dogs
15.
Genes (Basel) ; 12(6)2021 05 30.
Article in English | MEDLINE | ID: mdl-34070911

ABSTRACT

The domestic dog has evolved to be an important biomedical model for studies regarding the genetic basis of disease, morphology and behavior. Genetic studies in the dog have relied on a draft reference genome of a purebred female boxer dog named "Tasha" initially published in 2005. Derived from a Sanger whole genome shotgun sequencing approach coupled with limited clone-based sequencing, the initial assembly and subsequent updates have served as the predominant resource for canine genetics for 15 years. While the initial assembly produced a good-quality draft, as with all assemblies produced at the time, it contained gaps, assembly errors and missing sequences, particularly in GC-rich regions, which are found at many promoters and in the first exons of protein-coding genes. Here, we present Dog10K_Boxer_Tasha_1.0, an improved chromosome-level highly contiguous genome assembly of Tasha created with long-read technologies that increases sequence contiguity >100-fold, closes >23,000 gaps of the CanFam3.1 reference assembly and improves gene annotation by identifying >1200 new protein-coding transcripts. The assembly and annotation are available at NCBI under the accession GCF_000002285.5.


Subject(s)
Dogs/genetics , Genome , Animals , Contig Mapping , Molecular Sequence Annotation
16.
Proc Natl Acad Sci U S A ; 118(11)2021 03 16.
Article in English | MEDLINE | ID: mdl-33836575

ABSTRACT

Technological advances have allowed improvements in genome reference sequence assemblies. Here, we combined long- and short-read sequence resources to assemble the genome of a female Great Dane dog. This assembly has improved continuity compared to the existing Boxer-derived (CanFam3.1) reference genome. Annotation of the Great Dane assembly identified 22,182 protein-coding gene models and 7,049 long noncoding RNAs, including 49 protein-coding genes not present in the CanFam3.1 reference. The Great Dane assembly spans the majority of sequence gaps in the CanFam3.1 reference and illustrates that 2,151 gaps overlap the transcription start site of a predicted protein-coding gene. Moreover, a subset of the resolved gaps, which have an 80.95% median GC content, localize to transcription start sites and recombination hotspots more often than expected by chance, suggesting the stable canine recombinational landscape has shaped genome architecture. Alignment of the Great Dane and CanFam3.1 assemblies identified 16,834 deletions and 15,621 insertions, as well as 2,665 deletions and 3,493 insertions located on secondary contigs. These structural variants are dominated by retrotransposon insertion/deletion polymorphisms and include 16,221 dimorphic canine short interspersed elements (SINECs) and 1,121 dimorphic long interspersed element-1 sequences (LINE-1_Cfs). Analysis of sequences flanking the 3' end of LINE-1_Cfs (i.e., LINE-1_Cf 3'-transductions) suggests multiple retrotransposition-competent LINE-1_Cfs segregate among dog populations. Consistent with this conclusion, we demonstrate that a canine LINE-1_Cf element with intact open reading frames can retrotranspose its own RNA and that of a SINEC_Cf consensus sequence in cultured human cells, implicating ongoing retrotransposon activity as a driver of canine genetic variation.


Subject(s)
Dogs/genetics , GC Rich Sequence , Genome , Interspersed Repetitive Sequences , Animals , Dogs/classification , Long Interspersed Nucleotide Elements , Short Interspersed Nucleotide Elements , Species Specificity
17.
Genome Biol ; 22(1): 92, 2021 03 29.
Article in English | MEDLINE | ID: mdl-33781308

ABSTRACT

BACKGROUND: Post-zygotic mutations incurred during DNA replication, DNA repair, and other cellular processes lead to somatic mosaicism. Somatic mosaicism is an established cause of various diseases, including cancers. However, detecting mosaic variants in DNA from non-cancerous somatic tissues poses significant challenges, particularly if the variants only are present in a small fraction of cells. RESULTS: Here, the Brain Somatic Mosaicism Network conducts a coordinated, multi-institutional study to examine the ability of existing methods to detect simulated somatic single-nucleotide variants (SNVs) in DNA mixing experiments, generate multiple replicates of whole-genome sequencing data from the dorsolateral prefrontal cortex, other brain regions, dura mater, and dural fibroblasts of a single neurotypical individual, devise strategies to discover somatic SNVs, and apply various approaches to validate somatic SNVs. These efforts lead to the identification of 43 bona fide somatic SNVs that range in variant allele fractions from ~ 0.005 to ~ 0.28. Guided by these results, we devise best practices for calling mosaic SNVs from 250× whole-genome sequencing data in the accessible portion of the human genome that achieve 90% specificity and sensitivity. Finally, we demonstrate that analysis of multiple bulk DNA samples from a single individual allows the reconstruction of early developmental cell lineage trees. CONCLUSIONS: This study provides a unified set of best practices to detect somatic SNVs in non-cancerous tissues. The data and methods are freely available to the scientific community and should serve as a guide to assess the contributions of somatic SNVs to neuropsychiatric diseases.


Subject(s)
Brain/metabolism , Genetic Association Studies , Genetic Variation , Alleles , Chromosome Mapping , Computational Biology/methods , Genetic Association Studies/methods , Genomics/methods , Germ Cells/metabolism , High-Throughput Nucleotide Sequencing , Humans , Organ Specificity/genetics , Polymorphism, Single Nucleotide
18.
Genome Biol Evol ; 12(12): 2211-2230, 2020 12 06.
Article in English | MEDLINE | ID: mdl-32970804

ABSTRACT

Copy number variation (CNV) can promote phenotypic diversification and adaptive evolution. However, the genomic architecture of CNVs among Macaca species remains scarcely reported, and the roles of CNVs in adaptation and evolution of macaques have not been well addressed. Here, we identified and characterized 1,479 genome-wide hetero-specific CNVs across nine Macaca species with bioinformatic methods, along with 26 CNV-dense regions and dozens of lineage-specific CNVs. The genes intersecting CNVs were overrepresented in nutritional metabolism, xenobiotics/drug metabolism, and immune-related pathways. Population-level transcriptome data showed that nearly 46% of CNV genes were differentially expressed across populations and also mainly consisted of metabolic and immune-related genes, which implied the role of CNVs in environmental adaptation of Macaca. Several CNVs overlapping drug metabolism genes were verified with genomic quantitative polymerase chain reaction, suggesting that these macaques may have different drug metabolism features. The CNV-dense regions, including 15 first reported here, represent unstable genomic segments in macaques where biological innovation may evolve. Twelve gains and 40 losses specific to the Barbary macaque contain genes with essential roles in energy homeostasis and immunity defense, inferring the genetic basis of its unique distribution in North Africa. Our study not only elucidated the genetic diversity across Macaca species from the perspective of structural variation but also provided suggestive evidence for the role of CNVs in adaptation and genome evolution. Additionally, our findings provide new insights into the application of diverse macaques to drug study.


Subject(s)
Adaptation, Biological , Biological Evolution , DNA Copy Number Variations , Gene Duplication , Macaca/genetics , Animals
19.
Genes (Basel) ; 11(2)2020 01 29.
Article in English | MEDLINE | ID: mdl-32013076

ABSTRACT

Gene duplication is a major mechanism for the evolution of gene novelty, and copy-number variation makes a major contribution to inter-individual genetic diversity. However, most approaches for studying copy-number variation rely upon uniquely mapping reads to a genome reference and are unable to distinguish among duplicated sequences. Specialized approaches to interrogate specific paralogs are comparatively slow and have a high degree of computational complexity, limiting their effective application to emerging population-scale data sets. We present QuicK-mer2, a self-contained, mapping-free approach that enables the rapid construction of paralog-specific copy-number maps from short-read sequence data. This approach is based on the tabulation of unique k-mer sequences from short-read data sets, and is able to analyze a 20X coverage human genome in approximately 20 min. We applied our approach to newly released sequence data from the 1000 Genomes Project, constructed paralog-specific copy-number maps from 2457 unrelated individuals, and uncovered copy-number variation of paralogous genes. We identify nine genes where none of the analyzed samples have a copy number of two, 92 genes where the majority of samples have a copy number other than two, and describe rare copy number variation effecting multiple genes at the APOBEC3 locus.


Subject(s)
Computational Biology/methods , DNA Copy Number Variations , Sequence Analysis, DNA/methods , Algorithms , Evolution, Molecular , Gene Duplication , Genome, Human , Humans
20.
Nucleic Acids Res ; 48(6): e36, 2020 04 06.
Article in English | MEDLINE | ID: mdl-32067044

ABSTRACT

Alu retrotransposons account for more than 10% of the human genome, and insertions of these elements create structural variants segregating in human populations. Such polymorphic Alus are powerful markers to understand population structure, and they represent variants that can greatly impact genome function, including gene expression. Accurate genotyping of Alus and other mobile elements has been challenging. Indeed, we found that Alu genotypes previously called for the 1000 Genomes Project are sometimes erroneous, which poses significant problems for phasing these insertions with other variants that comprise the haplotype. To ameliorate this issue, we introduce a new pipeline - TypeTE - which genotypes Alu insertions from whole-genome sequencing data. Starting from a list of polymorphic Alus, TypeTE identifies the hallmarks (poly-A tail and target site duplication) and orientation of Alu insertions using local re-assembly to reconstruct presence and absence alleles. Genotype likelihoods are then computed after re-mapping sequencing reads to the reconstructed alleles. Using a high-quality set of PCR-based genotyping of >200 loci, we show that TypeTE improves genotype accuracy from 83% to 92% in the 1000 Genomes dataset. TypeTE can be readily adapted to other retrotransposon families and brings a valuable toolbox addition for population genomics.


Subject(s)
Interspersed Repetitive Sequences/genetics , Mutagenesis, Insertional/genetics , Software , Whole Genome Sequencing/methods , Databases, Genetic , Gene Frequency/genetics , Genetic Loci , Genetics, Population , Genome, Human , Genotype , Humans
SELECTION OF CITATIONS
SEARCH DETAIL