ABSTRACT
Tumor-infiltrating lymphocyte (TIL) therapy represents a groundbreaking advancement in the solid cancer treatment, offering new hope to patients and their families with high response rates and long overall survival. TIL therapy involves extracting immune cells from a patient's tumor tissue, expanding them ex vivo, and infusing them back into the patient to target and eliminate cancer cells. This revolutionary approach harnesses the power of the immune system to combat cancers, ushering in a new era of T cell-based therapies along with CAR-T and TCR-therapies. In this comprehensive review, we aim to elucidate the remarkable potential of TIL therapy by delving into recent advancements in basic and clinical researches. We highlight on the evolving landscape of TIL therapy as a prominent immunotherapeutic strategy, its multifaceted applications, and the promising outcomes. Additionally, we explore the future horizons of TIL therapy, next-generation TILs, and combination therapy, to overcome the limitations and improve clinical efficacy of TIL therapy.
Subject(s)
Immunotherapy, Adoptive , Lymphocytes, Tumor-Infiltrating , Neoplasms , Humans , Lymphocytes, Tumor-Infiltrating/immunology , Neoplasms/therapy , Neoplasms/immunology , Immunotherapy, Adoptive/methods , Animals , Combined Modality Therapy/methodsABSTRACT
Large-scale, population-based genomic studies have provided a context for modern medical genetics. Among such studies, however, African populations have remained relatively underrepresented. The breadth of genetic diversity across the African continent argues for an exploration of local genomic context to facilitate burgeoning disease mapping studies in Africa. We sought to characterize genetic variation and to assess population substructure within a cohort of HIV-positive children from Botswana-a Southern African country that is regionally underrepresented in genomic databases. Using whole-exome sequencing data from 164 Batswana and comparisons with 150 similarly sequenced HIV-positive Ugandan children, we found that 13%-25% of variation observed among Batswana was not captured by public databases. Uncaptured variants were significantly enriched (p = 2.2 × 10-16) for coding variants with minor allele frequencies between 1% and 5% and included predicted-damaging non-synonymous variants. Among variants found in public databases, corresponding allele frequencies varied widely, with Botswana having significantly higher allele frequencies among rare (<1%) pathogenic and damaging variants. Batswana clustered with other Southern African populations, but distinctly from 1000 Genomes African populations, and had limited evidence for admixture with extra-continental ancestries. We also observed a surprising lack of genetic substructure in Botswana, despite multiple tribal ethnicities and language groups, alongside a higher degree of relatedness than purported founder populations from the 1000 Genomes project. Our observations reveal a complex, but distinct, ancestral history and genomic architecture among Batswana and suggest that disease mapping within similar Southern African populations will require a deeper repository of genetic variation and allelic dependencies than presently exists.
Subject(s)
Black People/genetics , Exome Sequencing , Genetic Variation , Botswana , Cohort Studies , Gene Pool , Genetics, Population , Genome, Human , Geography , Humans , Phylogeny , Principal Component AnalysisABSTRACT
Tourette syndrome (TS) is a childhood-onset neuropsychiatric disorder characterized by repetitive motor movements and vocal tics. The clinical manifestations of TS are complex and often overlap with other neuropsychiatric disorders. TS is highly heritable; however, the underlying genetic basis and molecular and neuronal mechanisms of TS remain largely unknown. We performed whole-exome sequencing of a hundred trios (probands and their parents) with detailed records of their clinical presentations and identified a risk gene, ASH1L, that was both de novo mutated and associated with TS based on a transmission disequilibrium test. As a replication, we performed follow-up targeted sequencing of ASH1L in additional 524 unrelated TS samples and replicated the association (P value = 0.001). The point mutations in ASH1L cause defects in its enzymatic activity. Therefore, we established a transgenic mouse line and performed an array of anatomical, behavioral, and functional assays to investigate ASH1L function. The Ash1l+/- mice manifested tic-like behaviors and compulsive behaviors that could be rescued by the tic-relieving drug haloperidol. We also found that Ash1l disruption leads to hyper-activation and elevated dopamine-releasing events in the dorsal striatum, all of which could explain the neural mechanisms for the behavioral abnormalities in mice. Taken together, our results provide compelling evidence that ASH1L is a TS risk gene.
Subject(s)
DNA-Binding Proteins/genetics , Histone-Lysine N-Methyltransferase/genetics , Tourette Syndrome/genetics , Adolescent , Adult , Animals , Child , Child, Preschool , China , DNA-Binding Proteins/metabolism , Family , Female , Genetic Predisposition to Disease/genetics , Histone-Lysine N-Methyltransferase/metabolism , Humans , Male , Mice , Mice, Transgenic , Middle Aged , Mutation/genetics , Parents , Tic Disorders/genetics , Tourette Syndrome/complications , Transcription Factors/genetics , Exome Sequencing/methodsABSTRACT
Since discovered in Hubei, China in December 2019, Corona Virus Disease 2019 named COVID-19 has lasted more than one year, and the number of new confirmed cases and confirmed deaths is still at a high level. COVID-19 is an infectious disease caused by SARS-CoV-2. Although RT-PCR is considered the gold standard for detection of COVID-19, CT plays an important role in the diagnosis and evaluation of the therapeutic effect of COVID-19. Diagnosis and localization of COVID-19 on CT images using deep learning can provide quantitative auxiliary information for doctors. This article proposes a novel network with multi-receptive field attention module to diagnose COVID-19 on CT images. This attention module includes three parts, a pyramid convolution module (PCM), a multi-receptive field spatial attention block (SAB), and a multi-receptive field channel attention block (CAB). The PCM can improve the diagnostic ability of the network for lesions of different sizes and shapes. The role of SAB and CAB is to focus the features extracted from the network on the lesion area to improve the ability of COVID-19 discrimination and localization. We verify the effectiveness of the proposed method on two datasets. The accuracy rate of 97.12%, specificity of 96.89%, and sensitivity of 97.21% are achieved by the proposed network on DTDB dataset provided by the Beijing Ditan Hospital Capital Medical University. Compared with other state-of-the-art attention modules, the proposed method achieves better result. As for the public COVID-19 SARS-CoV-2 dataset, 95.16% for accuracy, 95.6% for F1-score and 99.01% for AUC are obtained. The proposed network can effectively assist doctors in the diagnosis of COVID-19 CT images.
ABSTRACT
Whole-genome sequencing (WGS) allows for a comprehensive view of the sequence of the human genome. We present and apply integrated methodologic steps for interrogating WGS data to characterize the genetic architecture of 10 heart- and blood-related traits in a sample of 1,860 African Americans. In order to evaluate the contribution of regulatory and non-protein coding regions of the genome, we conducted aggregate tests of rare variation across the entire genomic landscape using a sliding window, complemented by an annotation-based assessment of the genome using predefined regulatory elements and within the first intron of all genes. These tests were performed treating all variants equally as well as with individual variants weighted by a measure of predicted functional consequence. Significant findings were assessed in 1,705 individuals of European ancestry. After these steps, we identified and replicated components of the genomic landscape significantly associated with heart- and blood-related traits. For two traits, lipoprotein(a) levels and neutrophil count, aggregate tests of low-frequency and rare variation were significantly associated across multiple motifs. For a third trait, cardiac troponin T, investigation of regulatory domains identified a locus on chromosome 9. These practical approaches for WGS analysis led to the identification of informative genomic regions and also showed that defined non-coding regions, such as first introns of genes and regulatory domains, are associated with important risk factor phenotypes. This study illustrates the tractable nature of WGS data and outlines an approach for characterizing the genetic architecture of complex traits.
Subject(s)
Black or African American/genetics , Genome-Wide Association Study , Lipoprotein(a)/genetics , Troponin T/genetics , C-Reactive Protein/metabolism , Cholesterol, HDL/blood , Cholesterol, LDL/blood , Chromosomes, Human, Pair 9/genetics , Gene Frequency , Genome, Human , Genomics , Hemoglobins/metabolism , Humans , Introns , Leukocyte Count , Lipoprotein(a)/blood , Magnesium/blood , Natriuretic Peptide, Brain/blood , Natriuretic Peptide, Brain/genetics , Neutrophils/cytology , Peptide Fragments/blood , Peptide Fragments/genetics , Phosphorus/blood , Platelet Count , Polymorphism, Single Nucleotide , Quantitative Trait Loci , Troponin T/blood , White People/geneticsABSTRACT
Rhesus macaques (Macaca mulatta) are the most widely used nonhuman primate in biomedical research, have the largest natural geographic distribution of any nonhuman primate, and have been the focus of much evolutionary and behavioral investigation. Consequently, rhesus macaques are one of the most thoroughly studied nonhuman primate species. However, little is known about genome-wide genetic variation in this species. A detailed understanding of extant genomic variation among rhesus macaques has implications for the use of this species as a model for studies of human health and disease, as well as for evolutionary population genomics. Whole-genome sequencing analysis of 133 rhesus macaques revealed more than 43.7 million single-nucleotide variants, including thousands predicted to alter protein sequences, transcript splicing, and transcription factor binding sites. Rhesus macaques exhibit 2.5-fold higher overall nucleotide diversity and slightly elevated putative functional variation compared with humans. This functional variation in macaques provides opportunities for analyses of coding and noncoding variation, and its cellular consequences. Despite modestly higher levels of nonsynonymous variation in the macaques, the estimated distribution of fitness effects and the ratio of nonsynonymous to synonymous variants suggest that purifying selection has had stronger effects in rhesus macaques than in humans. Demographic reconstructions indicate this species has experienced a consistently large but fluctuating population size. Overall, the results presented here provide new insights into the population genomics of nonhuman primates and expand genomic information directly relevant to primate models of human disease.
Subject(s)
High-Throughput Nucleotide Sequencing/methods , Macaca mulatta/genetics , Whole Genome Sequencing/methods , Animals , Evolution, Molecular , Female , Genetic Fitness , Macaca mulatta/classification , Models, Animal , Polymorphism, Single Nucleotide , Population DensityABSTRACT
PURPOSE: To assess the clinical performance of an expanded noninvasive prenatal screening (NIPS) test ("NIPS-Plus") for detection of both aneuploidy and genome-wide microdeletion/microduplication syndromes (MMS). METHODS: A total of 94,085 women with a singleton pregnancy were prospectively enrolled in the study. The cell-free plasma DNA was directly sequenced without intermediate amplification and fetal abnormalities identified using an improved copy-number variation (CNV) calling algorithm. RESULTS: A total of 1128 pregnancies (1.2%) were scored positive for clinically significant fetal chromosome abnormalities. This comprised 965 aneuploidies (1.026%) and 163 (0.174%) MMS. From follow-up tests, the positive predictive values (PPVs) for T21, T18, T13, rare trisomies, and sex chromosome aneuploidies were calculated as 95%, 82%, 46%, 29%, and 47%, respectively. For known MMS (n = 32), PPVs were 93% (DiGeorge), 68% (22q11.22 microduplication), 75% (Prader-Willi/Angleman), and 50% (Cri du Chat). For the remaining genome-wide MMS (n = 88), combined PPVs were 32% (CNVs ≥10 Mb) and 19% (CNVs <10 Mb). CONCLUSION: NIPS-Plus yielded high PPVs for common aneuploidies and DiGeorge syndrome, and moderate PPVs for other MMS. Our results present compelling evidence that NIPS-Plus can be used as a first-tier pregnancy screening method to improve detection rates of clinically significant fetal chromosome abnormalities.
Subject(s)
Cell-Free Nucleic Acids/genetics , Chromosome Aberrations , Chromosome Disorders/diagnosis , Noninvasive Prenatal Testing/methods , Adolescent , Adult , Aneuploidy , Chromosome Disorders/genetics , Chromosome Disorders/pathology , DNA Copy Number Variations/genetics , Female , Humans , Karyotyping , Middle Aged , Pregnancy , Prenatal Diagnosis , Risk Factors , Sex Chromosome Aberrations , Trisomy/genetics , Young AdultABSTRACT
BACKGROUND: Next-generation sequencing is emerging as a viable alternative to chromosome microarray analysis for the diagnosis of chromosome disease syndromes. One next-generation sequencing methodology, copy number variation sequencing, has been shown to deliver high reliability, accuracy, and reproducibility for detection of fetal copy number variations in prenatal samples. However, its clinical utility as a first-tier diagnostic method has yet to be demonstrated in a large cohort of pregnant women referred for fetal chromosome testing. OBJECTIVE: We sought to evaluate copy number variation sequencing as a first-tier diagnostic method for detection of fetal chromosome anomalies in a general population of pregnant women with high-risk prenatal indications. STUDY DESIGN: This was a prospective analysis of 3429 pregnant women referred for amniocentesis and fetal chromosome testing for different risk indications, including advanced maternal age, high-risk maternal serum screening, and positivity for an ultrasound soft marker. Amniocentesis was performed by standard procedures. Amniocyte DNA was analyzed by copy number variation sequencing with a chromosome resolution of 0.1 Mb. Fetal chromosome anomalies including whole chromosome aneuploidy and segmental imbalances were independently confirmed by gold standard cytogenetic and molecular methods and their pathogenicity determined following guidelines of the American College of Medical Genetics for sequence variants. RESULTS: Clear interpretable copy number variation sequencing results were obtained for all 3429 amniocentesis samples. Copy number variation sequencing identified 3293 samples (96%) with a normal molecular karyotype and 136 samples (4%) with an altered molecular karyotype. A total of 146 fetal chromosome anomalies were detected, comprising 46 whole chromosome aneuploidies (pathogenic), 29 submicroscopic microdeletions/microduplications with known or suspected associations with chromosome disease syndromes (pathogenic), 22 other microdeletions/microduplications (likely pathogenic), and 49 variants of uncertain significance. Overall, the cumulative frequency of pathogenic/likely pathogenic and variants of uncertain significance chromosome anomalies in the patient cohort was 2.83% and 1.43%, respectively. In the 3 high-risk advanced maternal age, high-risk maternal serum screening, and ultrasound soft marker groups, the most common whole chromosome aneuploidy detected was trisomy 21, followed by sex chromosome aneuploidies, trisomy 18, and trisomy 13. Across all clinical indications, there was a similar incidence of submicroscopic copy number variations, with approximately equal proportions of pathogenic/likely pathogenic and variants of uncertain significance copy number variations. If karyotyping had been used as an alternate cytogenetics detection method, copy number variation sequencing would have returned a 1% higher yield of pathogenic or likely pathogenic copy number variations. CONCLUSION: In a large prospective clinical study, copy number variation sequencing delivered high reliability and accuracy for identifying clinically significant fetal anomalies in prenatal samples. Based on key performance criteria, copy number variation sequencing appears to be a well-suited methodology for first-tier diagnosis of pregnant women in the general population at risk of having a suspected fetal chromosome abnormality.
Subject(s)
Chromosome Disorders/diagnosis , DNA Copy Number Variations/genetics , Adult , Amniocentesis , Aneuploidy , China , Chromosome Aberrations , Chromosome Disorders/genetics , Down Syndrome/diagnosis , Female , High-Throughput Nucleotide Sequencing , Humans , In Situ Hybridization, Fluorescence , Karyotyping , Microarray Analysis , Pregnancy , Prenatal Diagnosis , Prospective Studies , Sequence Analysis, DNA , Sex Chromosome Aberrations , Trisomy 13 Syndrome/diagnosis , Trisomy 18 Syndrome/diagnosisABSTRACT
Detailed characterization of chromosomal abnormalities, a common cause for congenital abnormalities and pregnancy loss, is critical for elucidating genes for human fetal development. Here, 2,186 product-of-conception samples were tested for copy-number variations (CNVs) at two clinical diagnostic centers using whole-genome sequencing and high-resolution chromosomal microarray analysis. We developed a new gene discovery approach to predict potential developmental genes and identified 275 candidate genes from CNVs detected from both datasets. Based on Mouse Genome Informatics (MGI) and Zebrafish model organism database (ZFIN), 75% of identified genes could lead to developmental defects when mutated. Genes involved in embryonic development, gene transcription, and regulation of biological processes were significantly enriched. Especially, transcription factors and gene families sharing specific protein domains predominated, which included known developmental genes such as HOX, NKX homeodomain genes, and helix-loop-helix containing HAND2, NEUROG2, and NEUROD1 as well as potential novel developmental genes. We observed that developmental genes were denser in certain chromosomal regions, enabling identification of 31 potential genomic loci with clustered genes associated with development.
Subject(s)
Chromosome Aberrations , Chromosome Disorders/genetics , Embryonic Development/genetics , Transcription Factors/genetics , Animals , Chromosome Disorders/pathology , DNA Copy Number Variations/genetics , Female , Genome, Human , Humans , Mice , Microarray Analysis , Pregnancy , Zebrafish/geneticsABSTRACT
BACKGROUND: The cost of Whole Genome Sequencing (WGS) has decreased tremendously in recent years due to advances in next-generation sequencing technologies. Nevertheless, the cost of carrying out large-scale cohort studies using WGS is still daunting. Past simulation studies with coverage at ~2x have shown promise for using low coverage WGS in studies focused on variant discovery, association study replications, and population genomics characterization. However, the performance of low coverage WGS in populations with a complex history and no reference panel remains to be determined. RESULTS: South Indian populations are known to have a complex population structure and are an example of a major population group that lacks adequate reference panels. To test the performance of extremely low-coverage WGS (EXL-WGS) in populations with a complex history and to provide a reference resource for South Indian populations, we performed EXL-WGS on 185 South Indian individuals from eight populations to ~1.6x coverage. Using two variant discovery pipelines, SNPTools and GATK, we generated a consensus call set that has ~90% sensitivity for identifying common variants (minor allele frequency ≥ 10%). Imputation further improves the sensitivity of our call set. In addition, we obtained high-coverage for the whole mitochondrial genome to infer the maternal lineage evolutionary history of the Indian samples. CONCLUSIONS: Overall, we demonstrate that EXL-WGS with imputation can be a valuable study design for variant discovery with a dramatically lower cost than standard WGS, even in populations with a complex history and without available reference data. In addition, the South Indian EXL-WGS data generated in this study will provide a valuable resource for future Indian genomic studies.
Subject(s)
Asian People/genetics , Metagenomics , Whole Genome Sequencing , Genetic Variation , Genome, Mitochondrial/genetics , HumansABSTRACT
The radiation force of a high-energy laser caused by reflection at the input surface of a mounted KH2PO4 (KDP) crystal is studied, along with its effects on the second-harmonic generation (SHG) efficiency of the laser beam. A comprehensive model incorporating principles of momentum transfer, mechanics, and optics is proposed, taking advantage of which, the mechanical stress within the KDP crystal that is caused by the radiation force, and the SHG efficiency that is affected by the stress are successively studied. Moreover, the effects of the intensity of the laser beam on the radiation force, the stress, and the SHG efficiency are determined, respectively. It demonstrates that a high-energy laser beam causes macroscopic radiation force and further contributes negative effects to SHG efficiency.
ABSTRACT
BACKGROUND: The decreasing costs of sequencing are driving the need for cost effective and real time variant calling of whole genome sequencing data. The scale of these projects are far beyond the capacity of typical computing resources available with most research labs. Other infrastructures like the cloud AWS environment and supercomputers also have limitations due to which large scale joint variant calling becomes infeasible, and infrastructure specific variant calling strategies either fail to scale up to large datasets or abandon joint calling strategies. RESULTS: We present a high throughput framework including multiple variant callers for single nucleotide variant (SNV) calling, which leverages hybrid computing infrastructure consisting of cloud AWS, supercomputers and local high performance computing infrastructures. We present a novel binning approach for large scale joint variant calling and imputation which can scale up to over 10,000 samples while producing SNV callsets with high sensitivity and specificity. As a proof of principle, we present results of analysis on Cohorts for Heart And Aging Research in Genomic Epidemiology (CHARGE) WGS freeze 3 dataset in which joint calling, imputation and phasing of over 5300 whole genome samples was produced in under 6 weeks using four state-of-the-art callers. The callers used were SNPTools, GATK-HaplotypeCaller, GATK-UnifiedGenotyper and GotCloud. We used Amazon AWS, a 4000-core in-house cluster at Baylor College of Medicine, IBM power PC Blue BioU at Rice and Rhea at Oak Ridge National Laboratory (ORNL) for the computation. AWS was used for joint calling of 180 TB of BAM files, and ORNL and Rice supercomputers were used for the imputation and phasing step. All other steps were carried out on the local compute cluster. The entire operation used 5.2 million core hours and only transferred a total of 6 TB of data across the platforms. CONCLUSIONS: Even with increasing sizes of whole genome datasets, ensemble joint calling of SNVs for low coverage data can be accomplished in a scalable, cost effective and fast manner by using heterogeneous computing platforms without compromising on the quality of variants.
Subject(s)
Genome, Human , Genomics/methods , High-Throughput Nucleotide Sequencing/methods , Databases, Genetic , HumansABSTRACT
Understanding the evolution of disease-associated mutations is fundamental to analyze pathogenetics of diseases. Mutation, recombination (by GC-biased gene conversion, gBGC), and selection have been known to shape the evolution of disease-associated mutations, but how these evolutionary forces work together is still an open question. In this study, we analyzed several human large-scale datasets (1000 Genomes, ESP6500, ExAC and ClinVar), and found that base-biased mutagenesis generates more GCâAT than ATâGC mutations, while gBGC promotes the fixation of ATâGC mutations to balance the impact of base-biased mutation on genome. Due to this effect of gBGC, purifying selection removes more deleterious ATâGC mutations than GCâAT from population, but many high-frequency (fixed and nearly fixed) deleterious ATâGC mutations are remained possibly due to high genetic load. As a special subset, disease-associated mutations follow this evolutionary rule, in which disease-associated GCâAT mutations are more enriched in rare mutations compared with ATâGC, while disease-associated ATâGC are more enriched in mutations with high frequency. Thus, we presented a base-biased evolutionary framework that explains the base-biased generation and accumulation of disease-associated mutations in human populations.
Subject(s)
Genetic Predisposition to Disease , Mutation , Base Composition , Databases, Genetic , Evolution, Molecular , Gene Conversion , Genome, Human , Humans , Models, Genetic , Recombination, Genetic , Selection, GeneticABSTRACT
As the amount of human genomic sequence available from personal genomes and exomes has increased, so too has the observation of genomic positions having two or more alternative alleles, so-called multiallelic sites. For portions of the haploid genome that are present in more than one copy, including segmental duplications, variation at such multisite variant positions becomes even more complex. Despite the frequency of multiallelic variants, a number of commonly used resources and tools in genomic research and diagnostics do not support these multiallelic variants all together or require special modifications. Here, we explore the frequency of multiallelic sites in large samples with whole exome sequencing and discuss potential outcomes of failing to account for multiple variant alleles. We also briefly discuss some commonly utilized resources that fully support multiallelic sites.
Subject(s)
Alleles , Exome/genetics , Genome, Human/genetics , HumansABSTRACT
Next-generation sequencing is a powerful approach for discovering genetic variation. Sensitive variant calling and haplotype inference from population sequencing data remain challenging. We describe methods for high-quality discovery, genotyping, and phasing of SNPs for low-coverage (approximately 5×) sequencing of populations, implemented in a pipeline called SNPTools. Our pipeline contains several innovations that specifically address challenges caused by low-coverage population sequencing: (1) effective base depth (EBD), a nonparametric statistic that enables more accurate statistical modeling of sequencing data; (2) variance ratio scoring, a variance-based statistic that discovers polymorphic loci with high sensitivity and specificity; and (3) BAM-specific binomial mixture modeling (BBMM), a clustering algorithm that generates robust genotype likelihoods from heterogeneous sequencing data. Last, we develop an imputation engine that refines raw genotype likelihoods to produce high-quality phased genotypes/haplotypes. Designed for large population studies, SNPTools' input/output (I/O) and storage aware design leads to improved computing performance on large sequencing data sets. We apply SNPTools to the International 1000 Genomes Project (1000G) Phase 1 low-coverage data set and obtain genotyping accuracy comparable to that of SNP microarray.
Subject(s)
Genotype , Haplotypes , High-Throughput Nucleotide Sequencing/methods , Polymorphism, Single Nucleotide/genetics , Algorithms , Base Sequence , Human Genome Project , HumansABSTRACT
Despite great progress in identifying genetic variants that influence human disease, most inherited risk remains unexplained. A more complete understanding requires genome-wide studies that fully examine less common alleles in populations with a wide range of ancestry. To inform the design and interpretation of such studies, we genotyped 1.6 million common single nucleotide polymorphisms (SNPs) in 1,184 reference individuals from 11 global populations, and sequenced ten 100-kilobase regions in 692 of these individuals. This integrated data set of common and rare alleles, called 'HapMap 3', includes both SNPs and copy number polymorphisms (CNPs). We characterized population-specific differences among low-frequency variants, measured the improvement in imputation accuracy afforded by the larger reference panel, especially in imputing SNPs with a minor allele frequency of Subject(s)
DNA Copy Number Variations
, Genome, Human
, Polymorphism, Single Nucleotide
, Population Groups/genetics
, Human Genome Project
, Humans
ABSTRACT
BACKGROUND: Identifying insertion/deletion polymorphisms (INDELs) with high confidence has been intrinsically challenging in short-read sequencing data. Here we report our approach for improving INDEL calling accuracy by using a machine learning algorithm to combine call sets generated with three independent methods, and by leveraging the strengths of each individual pipeline. Utilizing this approach, we generated a consensus exome INDEL call set from a large dataset generated by the 1000 Genomes Project (1000G), maximizing both the sensitivity and the specificity of the calls. RESULTS: This consensus exome INDEL call set features 7,210 INDELs, from 1,128 individuals across 13 populations included in the 1000 Genomes Phase 1 dataset, with a false discovery rate (FDR) of about 7.0%. CONCLUSIONS: In our study we further characterize the patterns and distributions of these exonic INDELs with respect to density, allele length, and site frequency spectrum, as well as the potential mutagenic mechanisms of coding INDELs in humans.
Subject(s)
Exome/genetics , INDEL Mutation/genetics , Mutagenesis , Computational Biology , Genome, Human , High-Throughput Nucleotide Sequencing , Human Genome Project , Humans , Machine LearningABSTRACT
BACKGROUND: Generation of long (>5 Kb) DNA sequencing reads provides an approach for interrogation of complex regions in the human genome. Currently, large-insert whole genome sequencing (WGS) technologies from Pacific Biosciences (PacBio) enable analysis of chromosomal structural variations (SVs), but the cost to achieve the required sequence coverage across the entire human genome is high. RESULTS: We developed a method (termed PacBio-LITS) that combines oligonucleotide-based DNA target-capture enrichment technologies with PacBio large-insert library preparation to facilitate SV studies at specific chromosomal regions. PacBio-LITS provides deep sequence coverage at the specified sites at substantially reduced cost compared with PacBio WGS. The efficacy of PacBio-LITS is illustrated by delineating the breakpoint junctions of low copy repeat (LCR)-associated complex structural rearrangements on chr17p11.2 in patients diagnosed with Potocki-Lupski syndrome (PTLS; MIM#610883). We successfully identified previously determined breakpoint junctions in three PTLS cases, and also were able to discover novel junctions in repetitive sequences, including LCR-mediated breakpoints. The new information has enabled us to propose mechanisms for formation of these structural variants. CONCLUSIONS: The new method leverages the cost efficiency of targeted capture-sequencing as well as the mappability and scaffolding capabilities of long sequencing reads generated by the PacBio platform. It is therefore suitable for studying complex SVs, especially those involving LCRs, inversions, and the generation of chimeric Alu elements at the breakpoints. Other genomic research applications, such as haplotype phasing and small insertion and deletion validation could also benefit from this technology.
Subject(s)
Genomics/methods , Chromosome Aberrations , Gene Library , Gene Rearrangement , Genetic Association Studies/methods , High-Throughput Nucleotide Sequencing/methods , Humans , WorkflowABSTRACT
BACKGROUND: Massively parallel DNA sequencing generates staggering amounts of data. Decreasing cost, increasing throughput, and improved annotation have expanded the diversity of genomics applications in research and clinical practice. This expanding scale creates analytical challenges: accommodating peak compute demand, coordinating secure access for multiple analysts, and sharing validated tools and results. RESULTS: To address these challenges, we have developed the Mercury analysis pipeline and deployed it in local hardware and the Amazon Web Services cloud via the DNAnexus platform. Mercury is an automated, flexible, and extensible analysis workflow that provides accurate and reproducible genomic results at scales ranging from individuals to large cohorts. CONCLUSIONS: By taking advantage of cloud computing and with Mercury implemented on the DNAnexus platform, we have demonstrated a powerful combination of a robust and fully validated software pipeline and a scalable computational resource that, to date, we have applied to more than 10,000 whole genome and whole exome samples.
Subject(s)
Genomics/methods , High-Throughput Nucleotide Sequencing/methods , Internet , Software , Genome/genetics , HumansABSTRACT
Factor VIII (FVIII) functions as a cofactor for factor IXa in the contact coagulation pathway and circulates in a protective complex with von Willebrand factor (VWF). Plasma FVIII activity is strongly influenced by environmental and genetic factors through VWF-dependent and -independent mechanisms. Single nucleotide polymorphisms (SNPs) of the coding and promoter sequence in the FVIII gene have been extensively studied for effects on FVIII synthesis, secretion, and activity, but impacts of non-disease-causing intronic SNPs remain largely unknown. We analyzed FVIII SNPs and FVIII activity in 10,434 healthy Americans of European (EA) or African (AA) descent in the Atherosclerosis Risk in Communities (ARIC) study. Among covariates, age, race, diabetes, and ABO contributed 2.2%, 3.5%, 4%, and 10.7% to FVIII intersubject variation, respectively. Four intronic FVIII SNPs associated with FVIII activity and 8 with FVIII-VWF ratio in a sex- and race-dependent manner. The FVIII haplotypes AT and GCTTTT also associated with FVIII activity. Seven VWF SNPs were associated with FVIII activity in EA subjects, but no FVIII SNPs were associated with VWF Ag. These data demonstrate that intronic SNPs could directly or indirectly influence intersubject variation of FVIII activity. Further investigation may reveal novel mechanisms of regulating FVIII expression and activity.