Search | VHL Search Portal

Genome-wide identification of dominant polyadenylation hexamers for use in variant classification.

Shiferaw, Henoke K; Hong, Celine S; Cooper, David N; Johnston, Jennifer J; Biesecker, Leslie G.

Hum Mol Genet ; 32(23): 3211-3224, 2023 Nov 17.

Article in English | MEDLINE | ID: mdl-37606238

ABSTRACT

Polyadenylation is an essential process for the stabilization and export of mRNAs to the cytoplasm and the polyadenylation signal hexamer (herein referred to as hexamer) plays a key role in this process. Yet, only 14 Mendelian disorders have been associated with hexamer variants. This is likely an under-ascertainment as hexamers are not well defined and not routinely examined in molecular analysis. To facilitate the interrogation of putatively pathogenic hexamer variants, we set out to define functionally important hexamers genome-wide as a resource for research and clinical testing interrogation. We identified predominant polyA sites (herein referred to as pPAS) and putative predominant hexamers across protein coding genes (PAS usage >50% per gene). As a measure of the validity of these sites, the population constraint of 4532 predominant hexamers were measured. The predominant hexamers had fewer observed variants compared to non-predominant hexamers and trimer controls, and CADD scores for variants in these hexamers were significantly higher than controls. Exome data for 1477 individuals were interrogated for hexamer variants and transcriptome data were generated for 76 individuals with 65 variants in predominant hexamers. 3' RNA-seq data showed these variants resulted in alternate polyadenylation events (38%) and in elongated mRNA transcripts (12%). Our list of pPAS and predominant hexamers are available in the UCSC genome browser and on GitHub. We suggest this list of predominant hexamers can be used to interrogate exome and genome data. Variants in these predominant hexamers should be considered candidates for pathogenic variation in human disease, and to that end we suggest pathogenicity criteria for classifying hexamer variants.

Subject(s)

Genome , Polyadenylation , Humans , Polyadenylation/genetics

SomatoSim: precision simulation of somatic single nucleotide variants.

Hawari, Marwan A; Hong, Celine S; Biesecker, Leslie G.

BMC Bioinformatics ; 22(1): 109, 2021 Mar 06.

Article in English | MEDLINE | ID: mdl-33676403

ABSTRACT

BACKGROUND: Somatic single nucleotide variants have gained increased attention because of their role in cancer development and the widespread use of high-throughput sequencing techniques. The necessity to accurately identify these variants in sequencing data has led to a proliferation of somatic variant calling tools. Additionally, the use of simulated data to assess the performance of these tools has become common practice, as there is no gold standard dataset for benchmarking performance. However, many existing somatic variant simulation tools are limited because they rely on generating entirely synthetic reads derived from a reference genome or because they do not allow for the precise customizability that would enable a more focused understanding of single nucleotide variant calling performance. RESULTS: SomatoSim is a tool that lets users simulate somatic single nucleotide variants in sequence alignment map (SAM/BAM) files with full control of the specific variant positions, number of variants, variant allele fractions, depth of coverage, read quality, and base quality, among other parameters. SomatoSim accomplishes this through a three-stage process: variant selection, where candidate positions are selected for simulation, variant simulation, where reads are selected and mutated, and variant evaluation, where SomatoSim summarizes the simulation results. CONCLUSIONS: SomatoSim is a user-friendly tool that offers a high level of customizability for simulating somatic single nucleotide variants. SomatoSim is available at https://github.com/BieseckerLab/SomatoSim .

Subject(s)

Algorithms , Nucleotides , Software , Computer Simulation , High-Throughput Nucleotide Sequencing , Polymorphism, Single Nucleotide

Low-level variant calling for non-matched samples using a position-based and nucleotide-specific approach.

Dudley, Jeffrey N; Hong, Celine S; Hawari, Marwan A; Shwetar, Jasmine; Sapp, Julie C; Lack, Justin; Shiferaw, Henoke; Johnston, Jennifer J; Biesecker, Leslie G.

BMC Bioinformatics ; 22(1): 181, 2021 Apr 08.

Article in English | MEDLINE | ID: mdl-33832433

ABSTRACT

BACKGROUND: The widespread use of next-generation sequencing has identified an important role for somatic mosaicism in many diseases. However, detecting low-level mosaic variants from next-generation sequencing data remains challenging. RESULTS: Here, we present a method for Position-Based Variant Identification (PBVI) that uses empirically-derived distributions of alternate nucleotides from a control dataset. We modeled this approach on 11 segmental overgrowth genes. We show that this method improves detection of single nucleotide mosaic variants of 0.01-0.05 variant allele fraction compared to other low-level variant callers. At depths of 600 × and 1200 ×, we observed > 85% and > 95% sensitivity, respectively. In a cohort of 26 individuals with somatic overgrowth disorders PBVI showed improved signal to noise, identifying pathogenic variants in 17 individuals. CONCLUSION: PBVI can facilitate identification of low-level mosaic variants thus increasing the utility of next-generation sequencing data for research and diagnostic purposes.

Subject(s)

High-Throughput Nucleotide Sequencing , Nucleotides , Alleles , Cohort Studies , Humans , Nucleotides/genetics , Software

Characterizing reduced coverage regions through comparison of exome and genome sequencing data across 10 centers.

Sanghvi, Rashesh V; Buhay, Christian J; Powell, Bradford C; Tsai, Ellen A; Dorschner, Michael O; Hong, Celine S; Lebo, Matthew S; Sasson, Ariella; Hanna, David S; McGee, Sean; Bowling, Kevin M; Cooper, Gregory M; Gray, David E; Lonigro, Robert J; Dunford, Andrew; Brennan, Christine A; Cibulskis, Carrie; Walker, Kimberly; Carneiro, Mauricio O; Sailsbery, Joshua; Hindorff, Lucia A; Robinson, Dan R; Santani, Avni; Sarmady, Mahdi; Rehm, Heidi L; Biesecker, Leslie G; Nickerson, Deborah A; Hutter, Carolyn M; Garraway, Levi; Muzny, Donna M; Wagle, Nikhil.

Genet Med ; 20(8): 855-866, 2018 08.

Article in English | MEDLINE | ID: mdl-29144510

ABSTRACT

PURPOSE: As massively parallel sequencing is increasingly being used for clinical decision making, it has become critical to understand parameters that affect sequencing quality and to establish methods for measuring and reporting clinical sequencing standards. In this report, we propose a definition for reduced coverage regions and describe a set of standards for variant calling in clinical sequencing applications. METHODS: To enable sequencing centers to assess the regions of poor sequencing quality in their own data, we optimized and used a tool (ExCID) to identify reduced coverage loci within genes or regions of particular interest. We used this framework to examine sequencing data from 500 patients generated in 10 projects at sequencing centers in the National Human Genome Research Institute/National Cancer Institute Clinical Sequencing Exploratory Research Consortium. RESULTS: This approach identified reduced coverage regions in clinically relevant genes, including known clinically relevant loci that were uniquely missed at individual centers, in multiple centers, and in all centers. CONCLUSION: This report provides a process road map for clinical sequencing centers looking to perform similar analyses on their data.

Subject(s)

Exome Sequencing/methods , Sequence Analysis, DNA/methods , Whole Genome Sequencing/methods , Base Sequence , Chromosome Mapping , Exome , Genome, Human , High-Throughput Nucleotide Sequencing/methods , Humans , Sequence Analysis, DNA/standards , Software

Assessing the capability of massively parallel sequencing for opportunistic pharmacogenetic screening.

Ng, David; Hong, Celine S; Singh, Larry N; Johnston, Jennifer J; Mullikin, James C; Biesecker, Leslie G.

Genet Med ; 19(3): 357-361, 2017 03.

Article in English | MEDLINE | ID: mdl-27537706

ABSTRACT

PURPOSE: The aim of the study was to assess exome data for preemptive pharmacogenetic screening for 203 clinically relevant pharmacogenetic variant positions from the Pharmacogenomics Knowledgebase and Clinical Pharmacogenetics Implementation Consortium and identify copy-number variants (CNVs) in CYP2D6. METHODS: We examined the coverage and genotype quality of 203 pharmacogenetic variant positions in 973 exomes compared with five genomes and with five genotyping chip data sets. Then, we determined the agreement of exome and chip genotypes by evaluating concordance in a three-way comparison of exome, genome, and chip-based genotyping at 1,929 variant positions in five individuals. Finally, we evaluated the utility of exomes for detecting CYP2D6 CNVs. RESULTS: For 5 individuals examined for 203 pharmacogenetic variants (5 × 203 = 1,015), 998/1,015 were identified by genome, 849/1,015 were identified by exome, and 295/1,015 by genotyping chip. Thirty-six pharmacogenetic star allele variants with moderate to strong Clinical Pharmacogenetics Implementation Consortium (CPIC) therapeutic recommendations were identified in 973 exomes. Exomes had high (98%) genotype concordance with chip-based genotyping. CYP2D6 CNVs were identified in 57/973 exomes. CONCLUSIONS: Exomes outperformed the current chip-based assay in detecting more important pharmacogenetic variant positions and CYP2D6 CNVs for preemptive pharmacogenetic screening. Tools should be developed to derive pharmacogenetic variants from exomes.Genet Med 19 3, 357-361.

Subject(s)

Oligonucleotide Array Sequence Analysis/methods , Pharmacogenomic Testing/methods , Alleles , Cytochrome P-450 CYP2D6/genetics , DNA Copy Number Variations , Exome , Genotype , High-Throughput Nucleotide Sequencing/methods , Humans , Pharmacogenetics

Comprehensive characterization of the genomic alterations in human gastric cancer.

Cui, Juan; Yin, Yanbin; Ma, Qin; Wang, Guoqing; Olman, Victor; Zhang, Yu; Chou, Wen-Chi; Hong, Celine S; Zhang, Chi; Cao, Sha; Mao, Xizeng; Li, Ying; Qin, Steve; Zhao, Shaying; Jiang, Jing; Hastings, Phil; Li, Fan; Xu, Ying.

Int J Cancer ; 137(1): 86-95, 2015 Jul 01.

Article in English | MEDLINE | ID: mdl-25422082

ABSTRACT

Gastric cancer is one of the most prevalent and aggressive cancers worldwide, and its molecular mechanism remains largely elusive. Here we report the genomic landscape in primary gastric adenocarcinoma of human, based on the complete genome sequences of five pairs of cancer and matching normal samples. In total, 103,464 somatic point mutations, including 407 nonsynonymous ones, were identified and the most recurrent mutations were harbored by Mucins (MUC3A and MUC12) and transcription factors (ZNF717, ZNF595 and TP53). 679 genomic rearrangements were detected, which affect 355 protein-coding genes; and 76 genes show copy number changes. Through mapping the boundaries of the rearranged regions to the folded three-dimensional structure of human chromosomes, we determined that 79.6% of the chromosomal rearrangements happen among DNA fragments in close spatial proximity, especially when two endpoints stay in a similar replication phase. We demonstrated evidences that microhomology-mediated break-induced replication was utilized as a mechanism in inducing â¼40.9% of the identified genomic changes in gastric tumor. Our data analyses revealed potential integrations of Helicobacter pylori DNA into the gastric cancer genomes. Overall a large set of novel genomic variations were detected in these gastric cancer genomes, which may be essential to the study of the genetic basis and molecular mechanism of the gastric tumorigenesis.

Subject(s)

Adenocarcinoma/genetics , Chromosome Aberrations , Genetic Variation , Helicobacter Infections/genetics , Helicobacter pylori/physiology , Stomach Neoplasms/genetics , Adenocarcinoma/pathology , Adenocarcinoma/virology , Aged , DNA Copy Number Variations , DNA, Viral/analysis , Genome, Human , Humans , Male , Middle Aged , Point Mutation , Polymorphism, Single Nucleotide , Stomach Neoplasms/pathology , Stomach Neoplasms/virology

Assessing the reproducibility of exome copy number variations predictions.

Hong, Celine S; Singh, Larry N; Mullikin, James C; Biesecker, Leslie G.

Genome Med ; 8(1): 82, 2016 08 08.

Article in English | MEDLINE | ID: mdl-27503473

ABSTRACT

BACKGROUND: Reproducibility is receiving increased attention across many domains of science and genomics is no exception. Efforts to identify copy number variations (CNVs) from exome sequence (ES) data have been increasing. Many algorithms have been published to discover CNVs from exomes and a major challenge is the reproducibility in other datasets. Here we test exome CNV calling reproducibility under three conditions: data generated by different sequencing centers; varying sample sizes; and varying capture methodology. METHODS: Four CNV tools were tested: eXome Hidden Markov Model (XHMM), Copy Number Inference From Exome Reads (CoNIFER), EXCAVATOR, and Copy Number Analysis for Targeted Resequencing (CONTRA). To examine the reproducibility, we ran the callers on four datasets, varying sample sizes of N = 10, 30, 75, 100, 300, and data with different capture methodology. We examined the false negative (FN) calls and false positive (FP) calls for potential limitations of the CNV callers. The positive predictive value (PPV) was measured by checking the CNV call concordance against single nucleotide polymorphism array. RESULTS: Using independently generated datasets, we examined the PPV for each dataset and observed wide range of PPVs. The PPV values were highly data dependent (p <0.001). For the sample sizes and capture method analyses, we tested the callers in triplicates. Both analyses resulted in wide ranges of PPVs, even for the same test. Interestingly, negative correlations between the PPV and the sample sizes were observed for CoNIFER (ρ = -0.80). Further examination of FN calls showed that 44 % of these were missed by all callers and were attributed to the CNV size (46 % spanned ≤3 exons). Overlap of the FP calls showed that FPs were unique to each caller, indicative of algorithm dependency. CONCLUSIONS: Our results demonstrate that further improvements in CNV callers are necessary to improve reproducibility and to include wider spectrum of CNVs (including the small CNVs). These CNV callers should be evaluated on multiple independent, heterogeneously generated datasets of varying size to increase robustness and utility. These approaches to the evaluation of exome CNV are essential to support wide utility and applicability of CNV discovery in exome studies.

Subject(s)

Algorithms , DNA Copy Number Variations , Exome , Sequence Analysis, DNA/statistics & numerical data , Datasets as Topic , High-Throughput Nucleotide Sequencing , Humans , Markov Chains , Polymorphism, Single Nucleotide , Reproducibility of Results , Sample Size

A computational method for prediction of excretory proteins and application to identification of gastric cancer markers in urine.

Hong, Celine S; Cui, Juan; Ni, Zhaohui; Su, Yingying; Puett, David; Li, Fan; Xu, Ying.

PLoS One ; 6(2): e16875, 2011 Feb 18.

Article in English | MEDLINE | ID: mdl-21365014

ABSTRACT

A novel computational method for prediction of proteins excreted into urine is presented. The method is based on the identification of a list of distinguishing features between proteins found in the urine of healthy people and proteins deemed not to be urine excretory. These features are used to train a classifier to distinguish the two classes of proteins. When used in conjunction with information of which proteins are differentially expressed in diseased tissues of a specific type versus control tissues, this method can be used to predict potential urine markers for the disease. Here we report the detailed algorithm of this method and an application to identification of urine markers for gastric cancer. The performance of the trained classifier on 163 proteins was experimentally validated using antibody arrays, achieving >80% true positive rate. By applying the classifier on differentially expressed genes in gastric cancer vs normal gastric tissues, it was found that endothelial lipase (EL) was substantially suppressed in the urine samples of 21 gastric cancer patients versus 21 healthy individuals. Overall, we have demonstrated that our predictor for urine excretory proteins is highly effective and could potentially serve as a powerful tool in searches for disease biomarkers in urine in general.

Subject(s)

Biomarkers, Tumor/urine , Carcinoma/diagnosis , Computational Biology/methods , Proteins/metabolism , Stomach Neoplasms/diagnosis , Algorithms , Biomarkers, Tumor/analysis , Carcinoma/metabolism , Carcinoma/urine , Forecasting/methods , Humans , Prognosis , Stomach Neoplasms/metabolism , Stomach Neoplasms/urine , Urinalysis/methods , Urinalysis/statistics & numerical data

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL