Search | VHL Regional Portal

1.

Genetic risk factors for COVID-19 and influenza are largely distinct.

Kosmicki, Jack A; Marcketta, Anthony; Sharma, Deepika; Di Gioia, Silvio Alessandro; Batista, Samantha; Yang, Xiao-Man; Tzoneva, Gannie; Martinez, Hector; Sidore, Carlo; Kessler, Michael D; Horowitz, Julie E; Roberts, Genevieve H L; Justice, Anne E; Banerjee, Nilanjana; Coignet, Marie V; Leader, Joseph B; Park, Danny S; Lanche, Rouel; Maxwell, Evan; Knight, Spencer C; Bai, Xiaodong; Guturu, Harendra; Baltzell, Asher; Girshick, Ahna R; McCurdy, Shannon R; Partha, Raghavendran; Mansfield, Adam J; Turissini, David A; Zhang, Miao; Mbatchou, Joelle; Watanabe, Kyoko; Verma, Anurag; Sirugo, Giorgio; Ritchie, Marylyn D; Salerno, William J; Shuldiner, Alan R; Rader, Daniel J; Mirshahi, Tooraj; Marchini, Jonathan; Overton, John D; Carey, David J; Habegger, Lukas; Reid, Jeffrey G; Economides, Aris; Kyratsous, Christos; Karalis, Katia; Baum, Alina; Cantor, Michael N; Rand, Kristin A; Hong, Eurie L.

Nat Genet ; 56(8): 1592-1596, 2024 Aug.

Article in English | MEDLINE | ID: mdl-39103650

ABSTRACT

Coronavirus disease 2019 (COVID-19) and influenza are respiratory illnesses caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and influenza viruses, respectively. Both diseases share symptoms and clinical risk factors1, but the extent to which these conditions have a common genetic etiology is unknown. This is partly because host genetic risk factors are well characterized for COVID-19 but not for influenza, with the largest published genome-wide association studies for these conditions including >2 million individuals2 and about 1,000 individuals3-6, respectively. Shared genetic risk factors could point to targets to prevent or treat both infections. Through a genetic study of 18,334 cases with a positive test for influenza and 276,295 controls, we show that published COVID-19 risk variants are not associated with influenza. Furthermore, we discovered and replicated an association between influenza infection and noncoding variants in B3GALT5 and ST6GAL1, neither of which was associated with COVID-19. In vitro small interfering RNA knockdown of ST6GAL1-an enzyme that adds sialic acid to the cell surface, which is used for viral entry-reduced influenza infectivity by 57%. These results mirror the observation that variants that downregulate ACE2, the SARS-CoV-2 receptor, protect against COVID-19 (ref. 7). Collectively, these findings highlight downregulation of key cell surface receptors used for viral entry as treatment opportunities to prevent COVID-19 and influenza.

Subject(s)

COVID-19 , Genetic Predisposition to Disease , Genome-Wide Association Study , Influenza, Human , SARS-CoV-2 , Humans , Influenza, Human/genetics , Influenza, Human/epidemiology , Influenza, Human/virology , COVID-19/genetics , COVID-19/virology , Risk Factors , SARS-CoV-2/genetics , Male , Female , Polymorphism, Single Nucleotide , Case-Control Studies , Middle Aged

2.

A deep catalogue of protein-coding variation in 983,578 individuals.

Sun, Kathie Y; Bai, Xiaodong; Chen, Siying; Bao, Suying; Zhang, Chuanyi; Kapoor, Manav; Backman, Joshua; Joseph, Tyler; Maxwell, Evan; Mitra, George; Gorovits, Alexander; Mansfield, Adam; Boutkov, Boris; Gokhale, Sujit; Habegger, Lukas; Marcketta, Anthony; Locke, Adam E; Ganel, Liron; Hawes, Alicia; Kessler, Michael D; Sharma, Deepika; Staples, Jeffrey; Bovijn, Jonas; Gelfman, Sahar; Di Gioia, Alessandro; Rajagopal, Veera M; Lopez, Alexander; Varela, Jennifer Rico; Alegre-Díaz, Jesús; Berumen, Jaime; Tapia-Conyer, Roberto; Kuri-Morales, Pablo; Torres, Jason; Emberson, Jonathan; Collins, Rory; Cantor, Michael; Thornton, Timothy; Kang, Hyun Min; Overton, John D; Shuldiner, Alan R; Cremona, M Laura; Nafde, Mona; Baras, Aris; Abecasis, Gonçalo; Marchini, Jonathan; Reid, Jeffrey G; Salerno, William; Balasubramanian, Suganthi.

Nature ; 631(8021): 583-592, 2024 Jul.

Article in English | MEDLINE | ID: mdl-38768635

ABSTRACT

Rare coding variants that substantially affect function provide insights into the biology of a gene1-3. However, ascertaining the frequency of such variants requires large sample sizes4-8. Here we present a catalogue of human protein-coding variation, derived from exome sequencing of 983,578 individuals across diverse populations. In total, 23% of the Regeneron Genetics Center Million Exome (RGC-ME) data come from individuals of African, East Asian, Indigenous American, Middle Eastern and South Asian ancestry. The catalogue includes more than 10.4 million missense and 1.1 million predicted loss-of-function (pLOF) variants. We identify individuals with rare biallelic pLOF variants in 4,848 genes, 1,751 of which have not been previously reported. From precise quantitative estimates of selection against heterozygous loss of function (LOF), we identify 3,988 LOF-intolerant genes, including 86 that were previously assessed as tolerant and 1,153 that lack established disease annotation. We also define regions of missense depletion at high resolution. Notably, 1,482 genes have regions that are depleted of missense variants despite being tolerant of pLOF variants. Finally, we estimate that 3% of individuals have a clinically actionable genetic variant, and that 11,773 variants reported in ClinVar with unknown significance are likely to be deleterious cryptic splice sites. To facilitate variant interpretation and genetics-informed precision medicine, we make this resource of coding variation from the RGC-ME dataset publicly accessible through a variant allele frequency browser.

Subject(s)

Exome , Genetic Variation , Proteins , Humans , Alleles , Exome/genetics , Exome Sequencing , Gene Frequency , Genetic Variation/genetics , Heterozygote , Loss of Function Mutation/genetics , Mutation, Missense/genetics , Open Reading Frames/genetics , Proteins/genetics , RNA Splice Sites/genetics , Precision Medicine

3.

Author Correction: Genotyping, sequencing and analysis of 140,000 adults from Mexico City.

Ziyatdinov, Andrey; Torres, Jason; Alegre-Díaz, Jesús; Backman, Joshua; Mbatchou, Joelle; Turner, Michael; Gaynor, Sheila M; Joseph, Tyler; Zou, Yuxin; Liu, Daren; Wade, Rachel; Staples, Jeffrey; Panea, Razvan; Popov, Alex; Bai, Xiaodong; Balasubramanian, Suganthi; Habegger, Lukas; Lanche, Rouel; Lopez, Alex; Maxwell, Evan; Jones, Marcus; García-Ortiz, Humberto; Ramirez-Reyes, Raul; Santacruz-Benítez, Rogelio; Nag, Abhishek; Smith, Katherine R; Damask, Amy; Lin, Nan; Paulding, Charles; Reppell, Mark; Zöllner, Sebastian; Jorgenson, Eric; Salerno, William; Petrovski, Slavé; Overton, John; Reid, Jeffrey; Thornton, Timothy A; Abecasis, Gonçalo; Berumen, Jaime; Orozco-Orozco, Lorena; Collins, Rory; Baras, Aris; Hill, Michael R; Emberson, Jonathan R; Marchini, Jonathan; Kuri-Morales, Pablo; Tapia-Conyer, Roberto.

Nature ; 626(8001): E18, 2024 Feb.

Article in English | MEDLINE | ID: mdl-38332034

4.

Genotyping, sequencing and analysis of 140,000 adults from Mexico City.

Ziyatdinov, Andrey; Torres, Jason; Alegre-Díaz, Jesús; Backman, Joshua; Mbatchou, Joelle; Turner, Michael; Gaynor, Sheila M; Joseph, Tyler; Zou, Yuxin; Liu, Daren; Wade, Rachel; Staples, Jeffrey; Panea, Razvan; Popov, Alex; Bai, Xiaodong; Balasubramanian, Suganthi; Habegger, Lukas; Lanche, Rouel; Lopez, Alex; Maxwell, Evan; Jones, Marcus; García-Ortiz, Humberto; Ramirez-Reyes, Raul; Santacruz-Benítez, Rogelio; Nag, Abhishek; Smith, Katherine R; Damask, Amy; Lin, Nan; Paulding, Charles; Reppell, Mark; Zöllner, Sebastian; Jorgenson, Eric; Salerno, William; Petrovski, Slavé; Overton, John; Reid, Jeffrey; Thornton, Timothy A; Abecasis, Gonçalo; Berumen, Jaime; Orozco-Orozco, Lorena; Collins, Rory; Baras, Aris; Hill, Michael R; Emberson, Jonathan R; Marchini, Jonathan; Kuri-Morales, Pablo; Tapia-Conyer, Roberto.

Nature ; 622(7984): 784-793, 2023 Oct.

Article in English | MEDLINE | ID: mdl-37821707

ABSTRACT

The Mexico City Prospective Study is a prospective cohort of more than 150,000 adults recruited two decades ago from the urban districts of Coyoacán and Iztapalapa in Mexico City1. Here we generated genotype and exome-sequencing data for all individuals and whole-genome sequencing data for 9,950 selected individuals. We describe high levels of relatedness and substantial heterogeneity in ancestry composition across individuals. Most sequenced individuals had admixed Indigenous American, European and African ancestry, with extensive admixture from Indigenous populations in central, southern and southeastern Mexico. Indigenous Mexican segments of the genome had lower levels of coding variation but an excess of homozygous loss-of-function variants compared with segments of African and European origin. We estimated ancestry-specific allele frequencies at 142 million genomic variants, with an effective sample size of 91,856 for Indigenous Mexican ancestry at exome variants, all available through a public browser. Using whole-genome sequencing, we developed an imputation reference panel that outperforms existing panels at common variants in individuals with high proportions of central, southern and southeastern Indigenous Mexican ancestry. Our work illustrates the value of genetic studies in diverse populations and provides foundational imputation and allele frequency resources for future genetic studies in Mexico and in the United States, where the Hispanic/Latino population is predominantly of Mexican descent.

Subject(s)

Exome Sequencing , Genome, Human , Genotype , Hispanic or Latino , Adult , Humans , Africa/ethnology , Americas/ethnology , Europe/ethnology , Gene Frequency/genetics , Genetics, Population , Genome, Human/genetics , Genotyping Techniques , Hispanic or Latino/genetics , Homozygote , Loss of Function Mutation/genetics , Mexico , Prospective Studies

5.

A deep catalog of protein-coding variation in 985,830 individuals.

Sun, Kathie Y; Bai, Xiaodong; Chen, Siying; Bao, Suying; Kapoor, Manav; Zhang, Chuanyi; Backman, Joshua; Joseph, Tyler; Maxwell, Evan; Mitra, George; Gorovits, Alexander; Mansfield, Adam; Boutkov, Boris; Gokhale, Sujit; Habegger, Lukas; Marcketta, Anthony; Locke, Adam; Kessler, Michael D; Sharma, Deepika; Staples, Jeffrey; Bovijn, Jonas; Gelfman, Sahar; Gioia, Alessandro Di; Rajagopal, Veera; Lopez, Alexander; Varela, Jennifer Rico; Alegre, Jesus; Berumen, Jaime; Tapia-Conyer, Roberto; Kuri-Morales, Pablo; Torres, Jason; Emberson, Jonathan; Collins, Rory; Cantor, Michael; Thornton, Timothy; Kang, Hyun Min; Overton, John; Shuldiner, Alan R; Cremona, M Laura; Nafde, Mona; Baras, Aris; Abecasis, Goncalo; Marchini, Jonathan; Reid, Jeffrey G; Salerno, William; Balasubramanian, Suganthi.

bioRxiv ; 2023 Nov 02.

Article in English | MEDLINE | ID: mdl-37214792

ABSTRACT

Coding variants that have significant impact on function can provide insights into the biology of a gene but are typically rare in the population. Identifying and ascertaining the frequency of such rare variants requires very large sample sizes. Here, we present the largest catalog of human protein-coding variation to date, derived from exome sequencing of 985,830 individuals of diverse ancestry to serve as a rich resource for studying rare coding variants. Individuals of African, Admixed American, East Asian, Middle Eastern, and South Asian ancestry account for 20% of this Exome dataset. Our catalog of variants includes approximately 10.5 million missense (54% novel) and 1.1 million predicted loss-of-function (pLOF) variants (65% novel, 53% observed only once). We identified individuals with rare homozygous pLOF variants in 4,874 genes, and for 1,838 of these this work is the first to document at least one pLOF homozygote. Additional insights from the RGC-ME dataset include 1) improved estimates of selection against heterozygous loss-of-function and identification of 3,459 genes intolerant to loss-of-function, 83 of which were previously assessed as tolerant to loss-of-function and 1,241 that lack disease annotations; 2) identification of regions depleted of missense variation in 457 genes that are tolerant to loss-of-function; 3) functional interpretation for 10,708 variants of unknown or conflicting significance reported in ClinVar as cryptic splice sites using splicing score thresholds based on empirical variant deleteriousness scores derived from RGC-ME; and 4) an observation that approximately 3% of sequenced individuals carry a clinically actionable genetic variant in the ACMG SF 3.1 list of genes. We make this important resource of coding variation available to the public through a variant allele frequency browser. We anticipate that this report and the RGC-ME dataset will serve as a valuable reference for understanding rare coding variation and help advance precision medicine efforts.

6.

Medical manifestations and health care utilization among adult MyCode participants with neurodevelopmental psychiatric copy number variants.

Finucane, Brenda; Oetjens, Matthew T; Johns, Alicia; Myers, Scott M; Fisher, Ciaran; Habegger, Lukas; Maxwell, Evan K; Reid, Jeffrey G; Ledbetter, David H; Kirchner, H Lester; Martin, Christa Lese.

Genet Med ; 24(3): 703-711, 2022 03.

Article in English | MEDLINE | ID: mdl-34906480

ABSTRACT

PURPOSE: Recurrent pathogenic copy number variants (pCNVs) have large-effect impacts on brain function and represent important etiologies of neurodevelopmental psychiatric disorders (NPDs), including autism and schizophrenia. Patterns of health care utilization in adults with pCNVs have gone largely unstudied and are likely to differ in significant ways from those of children. METHODS: We compared the prevalence of NPDs and electronic health record-based medical conditions in 928 adults with 26 pCNVs to a demographically-matched cohort of pCNV-negative controls from >135,000 patient-participants in Geisinger's MyCode Community Health Initiative. We also evaluated 3 quantitative health care utilization measures (outpatient, inpatient, and emergency department visits) in both groups. RESULTS: Adults with pCNVs (24.9%) were more likely than controls (16.0%) to have a documented NPD. They had significantly higher rates of several chronic diseases, including diabetes (29.3% in participants with pCNVs vs 20.4% in participants without pCNVs) and dementia (2.2% in participants with pCNVs vs 1.0% participants without pCNVs), and twice as many annual emergency department visits. CONCLUSION: These findings highlight the potential for genetic information-specifically, pCNVs-to inform the study of health care outcomes and utilization in adults. If, as our findings suggest, adults with pCNVs have poorer health and require disproportionate health care resources, early genetic diagnosis paired with patient-centered interventions may help to anticipate problems, improve outcomes, and reduce the associated economic burden.

Subject(s)

DNA Copy Number Variations , Delivery of Health Care , Adult , Child , Cohort Studies , DNA Copy Number Variations/genetics , Humans , Patient Acceptance of Health Care , Prevalence

7.

Exome sequencing and analysis of 454,787 UK Biobank participants.

Backman, Joshua D; Li, Alexander H; Marcketta, Anthony; Sun, Dylan; Mbatchou, Joelle; Kessler, Michael D; Benner, Christian; Liu, Daren; Locke, Adam E; Balasubramanian, Suganthi; Yadav, Ashish; Banerjee, Nilanjana; Gillies, Christopher E; Damask, Amy; Liu, Simon; Bai, Xiaodong; Hawes, Alicia; Maxwell, Evan; Gurski, Lauren; Watanabe, Kyoko; Kosmicki, Jack A; Rajagopal, Veera; Mighty, Jason; Jones, Marcus; Mitnaul, Lyndon; Stahl, Eli; Coppola, Giovanni; Jorgenson, Eric; Habegger, Lukas; Salerno, William J; Shuldiner, Alan R; Lotta, Luca A; Overton, John D; Cantor, Michael N; Reid, Jeffrey G; Yancopoulos, George; Kang, Hyun M; Marchini, Jonathan; Baras, Aris; Abecasis, Gonçalo R; Ferreira, Manuel A R.

Nature ; 599(7886): 628-634, 2021 11.

Article in English | MEDLINE | ID: mdl-34662886

ABSTRACT

A major goal in human genetics is to use natural variation to understand the phenotypic consequences of altering each protein-coding gene in the genome. Here we used exome sequencing1 to explore protein-altering variants and their consequences in 454,787 participants in the UK Biobank study2. We identified 12 million coding variants, including around 1 million loss-of-function and around 1.8 million deleterious missense variants. When these were tested for association with 3,994 health-related traits, we found 564 genes with trait associations at P ≤ 2.18 × 10-11. Rare variant associations were enriched in loci from genome-wide association studies (GWAS), but most (91%) were independent of common variant signals. We discovered several risk-increasing associations with traits related to liver disease, eye disease and cancer, among others, as well as risk-lowering associations for hypertension (SLC9A3R2), diabetes (MAP3K15, FAM234A) and asthma (SLC27A3). Six genes were associated with brain imaging phenotypes, including two involved in neural development (GBE1, PLD1). Of the signals available and powered for replication in an independent cohort, 81% were confirmed; furthermore, association signals were generally consistent across individuals of European, Asian and African ancestry. We illustrate the ability of exome sequencing to identify gene-trait associations, elucidate gene function and pinpoint effector genes that underlie GWAS signals at scale.

Subject(s)

Biological Specimen Banks , Databases, Genetic , Exome Sequencing , Exome/genetics , Africa/ethnology , Asia/ethnology , Asthma/genetics , Diabetes Mellitus/genetics , Europe/ethnology , Eye Diseases/genetics , Female , Genetic Predisposition to Disease/genetics , Genetic Variation , Genome-Wide Association Study , Humans , Hypertension/genetics , Liver Diseases/genetics , Male , Mutation , Neoplasms/genetics , Quantitative Trait, Heritable , United Kingdom

8.

Enzyme inhibition as a potential therapeutic strategy to treat COVID-19 infection.

Paulsson-Habegger, Lukas; Snabaitis, Andrew K; Wren, Stephen P.

Bioorg Med Chem ; 48: 116389, 2021 10 15.

Article in English | MEDLINE | ID: mdl-34543844

ABSTRACT

With the emergence of the third infectious and virulent coronavirus within the past two decades, it has become increasingly important to understand how the virus causes infection. This will inform therapeutic strategies that target vulnerabilities in the vital processes through which the virus enters cells. This review identifies enzymes responsible for SARS-CoV-2 viral entry into cells (ACE2, Furin, TMPRSS2) and discuss compounds proposed to inhibit viral entry with the end goal of treating COVID-19 infection. We argue that TMPRSS2 inhibitors show the most promise in potentially treating COVID-19, in addition to being a pre-existing medication with fewer predicted side-effects.

Subject(s)

Angiotensin Receptor Antagonists/therapeutic use , Angiotensin-Converting Enzyme 2/antagonists & inhibitors , Antiviral Agents/therapeutic use , COVID-19 Drug Treatment , Janus Kinase Inhibitors/therapeutic use , SARS-CoV-2/drug effects , Animals , Drug Combinations , Humans , Methotrexate/therapeutic use , Receptors, Angiotensin/metabolism , Signal Transduction/drug effects

9.

Computationally efficient whole-genome regression for quantitative and binary traits.

Mbatchou, Joelle; Barnard, Leland; Backman, Joshua; Marcketta, Anthony; Kosmicki, Jack A; Ziyatdinov, Andrey; Benner, Christian; O'Dushlaine, Colm; Barber, Mathew; Boutkov, Boris; Habegger, Lukas; Ferreira, Manuel; Baras, Aris; Reid, Jeffrey; Abecasis, Goncalo; Maxwell, Evan; Marchini, Jonathan.

Nat Genet ; 53(7): 1097-1103, 2021 07.

Article in English | MEDLINE | ID: mdl-34017140

ABSTRACT

Genome-wide association analysis of cohorts with thousands of phenotypes is computationally expensive, particularly when accounting for sample relatedness or population structure. Here we present a novel machine-learning method called REGENIE for fitting a whole-genome regression model for quantitative and binary phenotypes that is substantially faster than alternatives in multi-trait analyses while maintaining statistical efficiency. The method naturally accommodates parallel analysis of multiple phenotypes and requires only local segments of the genotype matrix to be loaded in memory, in contrast to existing alternatives, which must load genome-wide matrices into memory. This results in substantial savings in compute time and memory usage. We introduce a fast, approximate Firth logistic regression test for unbalanced case-control phenotypes. The method is ideally suited to take advantage of distributed computing frameworks. We demonstrate the accuracy and computational benefits of this approach using the UK Biobank dataset with up to 407,746 individuals.

Subject(s)

Computational Biology , Genome-Wide Association Study , Genomics , Case-Control Studies , Computational Biology/methods , Genome-Wide Association Study/methods , Genomics/methods , Genotype , Humans , Logistic Models , Machine Learning , Phenotype , Reproducibility of Results

10.

Exome sequencing and characterization of 49,960 individuals in the UK Biobank.

Van Hout, Cristopher V; Tachmazidou, Ioanna; Backman, Joshua D; Hoffman, Joshua D; Liu, Daren; Pandey, Ashutosh K; Gonzaga-Jauregui, Claudia; Khalid, Shareef; Ye, Bin; Banerjee, Nilanjana; Li, Alexander H; O'Dushlaine, Colm; Marcketta, Anthony; Staples, Jeffrey; Schurmann, Claudia; Hawes, Alicia; Maxwell, Evan; Barnard, Leland; Lopez, Alexander; Penn, John; Habegger, Lukas; Blumenfeld, Andrew L; Bai, Xiaodong; O'Keeffe, Sean; Yadav, Ashish; Praveen, Kavita; Jones, Marcus; Salerno, William J; Chung, Wendy K; Surakka, Ida; Willer, Cristen J; Hveem, Kristian; Leader, Joseph B; Carey, David J; Ledbetter, David H; Cardon, Lon; Yancopoulos, George D; Economides, Aris; Coppola, Giovanni; Shuldiner, Alan R; Balasubramanian, Suganthi; Cantor, Michael; Nelson, Matthew R; Whittaker, John; Reid, Jeffrey G; Marchini, Jonathan; Overton, John D; Scott, Robert A; Abecasis, Gonçalo R; Yerges-Armstrong, Laura.

Nature ; 586(7831): 749-756, 2020 10.

Article in English | MEDLINE | ID: mdl-33087929

ABSTRACT

The UK Biobank is a prospective study of 502,543 individuals, combining extensive phenotypic and genotypic data with streamlined access for researchers around the world1. Here we describe the release of exome-sequence data for the first 49,960 study participants, revealing approximately 4 million coding variants (of which around 98.6% have a frequency of less than 1%). The data include 198,269 autosomal predicted loss-of-function (LOF) variants, a more than 14-fold increase compared to the imputed sequence. Nearly all genes (more than 97%) had at least one carrier with a LOF variant, and most genes (more than 69%) had at least ten carriers with a LOF variant. We illustrate the power of characterizing LOF variants in this population through association analyses across 1,730 phenotypes. In addition to replicating established associations, we found novel LOF variants with large effects on disease traits, including PIEZO1 on varicose veins, COL6A1 on corneal resistance, MEPE on bone density, and IQGAP2 and GMPR on blood cell traits. We further demonstrate the value of exome sequencing by surveying the prevalence of pathogenic variants of clinical importance, and show that 2% of this population has a medically actionable variant. Furthermore, we characterize the penetrance of cancer in carriers of pathogenic BRCA1 and BRCA2 variants. Exome sequences from the first 49,960 participants highlight the promise of genome sequencing in large population-based studies and are now accessible to the scientific community.

Subject(s)

Databases, Genetic , Exome Sequencing , Exome/genetics , Loss of Function Mutation/genetics , Phenotype , Aged , Bone Density/genetics , Collagen Type VI/genetics , Demography , Female , Genes, BRCA1 , Genes, BRCA2 , Genotype , Humans , Ion Channels/genetics , Male , Middle Aged , Neoplasms/genetics , Penetrance , Peptide Fragments/genetics , United Kingdom , Varicose Veins/genetics , ras GTPase-Activating Proteins/genetics

11.

Identification of Neuropsychiatric Copy Number Variants in a Health Care System Population.

Martin, Christa Lese; Wain, Karen E; Oetjens, Matthew T; Tolwinski, Kasia; Palen, Emily; Hare-Harris, Abby; Habegger, Lukas; Maxwell, Evan K; Reid, Jeffrey G; Walsh, Lauren Kasparson; Myers, Scott M; Ledbetter, David H.

JAMA Psychiatry ; 77(12): 1276-1285, 2020 12 01.

Article in English | MEDLINE | ID: mdl-32697297

ABSTRACT

Importance: Population screening for medically relevant genomic variants that cause diseases such as hereditary cancer and cardiovascular disorders is increasing to facilitate early disease detection or prevention. Neuropsychiatric disorders (NPDs) are common, complex disorders with clear genetic causes; yet, access to genetic diagnosis is limited. We explored whether inclusion of NPD in population-based genomic screening programs is warranted by assessing 3 key factors: prevalence, penetrance, and personal utility. Objective: To evaluate the suitability of including pathogenic copy number variants (CNVs) associated with NPD in population screening by determining their prevalence and penetrance and exploring the personal utility of disclosing results. Design, Setting, and Participants: In this cohort study, the frequency of 31 NPD CNVs was determined in patient-participants via exome data. Associated clinical phenotypes were assessed using linked electronic health records. Nine CNVs were selected for disclosure by licensed genetic counselors, and participants' psychosocial reactions were evaluated using a mixed-methods approach. A primarily adult population receiving medical care at Geisinger, a large integrated health care system in the United States with the only population-based genomic screening program approved for medically relevant results disclosure, was included. The cohort was identified from the Geisinger MyCode Community Health Initiative. Exome and linked electronic health record data were available for this cohort, which was recruited from February 2007 to April 2017. Data were collected for the qualitative analysis April 2017 through February 2018. Analysis began February 2018 and ended December 2019. Main Outcomes and Measures: The planned outcomes of this study include (1) prevalence estimate of NPD-associated CNVs in an unselected health care system population; (2) penetrance estimate of NPD diagnoses in CNV-positive individuals; and (3) qualitative themes that describe participants' responses to receiving NPD-associated genomic results. Results: Of 90â¯595 participants with CNV data, a pathogenic CNV was identified in 708 (0.8%; 436 women [61.6%]; mean [SD] age, 50.04 [18.74] years). Seventy percent (n = 494) had at least 1 associated clinical symptom. Of these, 28.8% (204) of CNV-positive individuals had an NPD code in their electronic health record, compared with 13.3% (11 835 of 89 887) of CNV-negative individuals (odds ratio, 2.21; 95% CI, 1.86-2.61; P < .001); 66.4% (470) of CNV-positive individuals had a history of depression and anxiety compared with 54.6% (49 118 of 89 887) of CNV-negative individuals (odds ratio, 1.53; 95% CI, 1.31-1.80; P < .001). 16p13.11 (71 [0.078%]) and 22q11.2 (108 [0.119%]) were the most prevalent deletions and duplications, respectively. Only 5.8% of individuals (41 of 708) had a previously known genetic diagnosis. Results disclosure was completed for 141 individuals. Positive participant responses included poignant reactions to learning a medical reason for lifelong cognitive and psychiatric disabilities. Conclusions and Relevance: This study informs critical factors central to the development of population-based genomic screening programs and supports the inclusion of NPD in future designs to promote equitable access to clinically useful genomic information.

Subject(s)

DNA Copy Number Variations/genetics , Delivery of Health Care, Integrated , Genetic Testing , Mass Screening , Mental Disorders/genetics , Neurocognitive Disorders/genetics , Patient Satisfaction , Penetrance , Adult , Cohort Studies , Electronic Health Records , Female , Humans , Male , Mass Screening/standards , Mental Disorders/epidemiology , Middle Aged , Neurocognitive Disorders/epidemiology , Pennsylvania/epidemiology , Prevalence , Exome Sequencing

12.

Genetic inactivation of ANGPTL4 improves glucose homeostasis and is associated with reduced risk of diabetes.

Gusarova, Viktoria; O'Dushlaine, Colm; Teslovich, Tanya M; Benotti, Peter N; Mirshahi, Tooraj; Gottesman, Omri; Van Hout, Cristopher V; Murray, Michael F; Mahajan, Anubha; Nielsen, Jonas B; Fritsche, Lars; Wulff, Anders Berg; Gudbjartsson, Daniel F; Sjögren, Marketa; Emdin, Connor A; Scott, Robert A; Lee, Wen-Jane; Small, Aeron; Kwee, Lydia C; Dwivedi, Om Prakash; Prasad, Rashmi B; Bruse, Shannon; Lopez, Alexander E; Penn, John; Marcketta, Anthony; Leader, Joseph B; Still, Christopher D; Kirchner, H Lester; Mirshahi, Uyenlinh L; Wardeh, Amr H; Hartle, Cassandra M; Habegger, Lukas; Fetterolf, Samantha N; Tusie-Luna, Teresa; Morris, Andrew P; Holm, Hilma; Steinthorsdottir, Valgerdur; Sulem, Patrick; Thorsteinsdottir, Unnur; Rotter, Jerome I; Chuang, Lee-Ming; Damrauer, Scott; Birtwell, David; Brummett, Chad M; Khera, Amit V; Natarajan, Pradeep; Orho-Melander, Marju; Flannick, Jason; Lotta, Luca A; Willer, Cristen J.

Nat Commun ; 9(1): 2252, 2018 06 13.

Article in English | MEDLINE | ID: mdl-29899519

ABSTRACT

Angiopoietin-like 4 (ANGPTL4) is an endogenous inhibitor of lipoprotein lipase that modulates lipid levels, coronary atherosclerosis risk, and nutrient partitioning. We hypothesize that loss of ANGPTL4 function might improve glucose homeostasis and decrease risk of type 2 diabetes (T2D). We investigate protein-altering variants in ANGPTL4 among 58,124 participants in the DiscovEHR human genetics study, with follow-up studies in 82,766 T2D cases and 498,761 controls. Carriers of p.E40K, a variant that abolishes ANGPTL4 ability to inhibit lipoprotein lipase, have lower odds of T2D (odds ratio 0.89, 95% confidence interval 0.85-0.92, p = 6.3 × 10-10), lower fasting glucose, and greater insulin sensitivity. Predicted loss-of-function variants are associated with lower odds of T2D among 32,015 cases and 84,006 controls (odds ratio 0.71, 95% confidence interval 0.49-0.99, p = 0.041). Functional studies in Angptl4-deficient mice confirm improved insulin sensitivity and glucose homeostasis. In conclusion, genetic inactivation of ANGPTL4 is associated with improved glucose homeostasis and reduced risk of T2D.

Subject(s)

Angiopoietin-Like Protein 4/deficiency , Angiopoietin-Like Protein 4/genetics , Diabetes Mellitus, Type 2/genetics , Diabetes Mellitus, Type 2/metabolism , Amino Acid Substitution , Angiopoietin-Like Protein 4/metabolism , Animals , Blood Glucose/metabolism , Case-Control Studies , Diabetes Mellitus, Type 2/etiology , Female , Gene Silencing , Genetic Association Studies , Genetic Variation , Heterozygote , Homeostasis , Humans , Insulin Resistance/genetics , Lipoprotein Lipase/metabolism , Male , Mice , Mice, Inbred C57BL , Mice, Knockout , Risk Factors , Exome Sequencing

13.

Profiling and Leveraging Relatedness in a Precision Medicine Cohort of 92,455 Exomes.

Staples, Jeffrey; Maxwell, Evan K; Gosalia, Nehal; Gonzaga-Jauregui, Claudia; Snyder, Christopher; Hawes, Alicia; Penn, John; Ulloa, Ricardo; Bai, Xiaodong; Lopez, Alexander E; Van Hout, Cristopher V; O'Dushlaine, Colm; Teslovich, Tanya M; McCarthy, Shane E; Balasubramanian, Suganthi; Kirchner, H Lester; Leader, Joseph B; Murray, Michael F; Ledbetter, David H; Shuldiner, Alan R; Yancoupolos, George D; Dewey, Frederick E; Carey, David J; Overton, John D; Baras, Aris; Habegger, Lukas; Reid, Jeffrey G.

Am J Hum Genet ; 102(5): 874-889, 2018 05 03.

Article in English | MEDLINE | ID: mdl-29727688

ABSTRACT

Large-scale human genetics studies are ascertaining increasing proportions of populations as they continue growing in both number and scale. As a result, the amount of cryptic relatedness within these study cohorts is growing rapidly and has significant implications on downstream analyses. We demonstrate this growth empirically among the first 92,455 exomes from the DiscovEHR cohort and, via a custom simulation framework we developed called SimProgeny, show that these measures are in line with expectations given the underlying population and ascertainment approach. For example, within DiscovEHR we identified â¼66,000 close (first- and second-degree) relationships, involving 55.6% of study participants. Our simulation results project that >70% of the cohort will be involved in these close relationships, given that DiscovEHR scales to 250,000 recruited individuals. We reconstructed 12,574 pedigrees by using these relationships (including 2,192 nuclear families) and leveraged them for multiple applications. The pedigrees substantially improved the phasing accuracy of 20,947 rare, deleterious compound heterozygous mutations. Reconstructed nuclear families were critical for identifying 3,415 de novo mutations in â¼1,783 genes. Finally, we demonstrate the segregation of known and suspected disease-causing mutations, including a tandem duplication that occurs in LDLR and causes familial hypercholesterolemia, through reconstructed pedigrees. In summary, this work highlights the prevalence of cryptic relatedness expected among large healthcare population-genomic studies and demonstrates several analyses that are uniquely enabled by large amounts of cryptic relatedness.

Subject(s)

Exome/genetics , Precision Medicine , Cohort Studies , Computer Simulation , Electronic Health Records , Exons/genetics , Family , Female , Genetics, Population , Geography , Heterozygote , Humans , Male , Mutation/genetics , Pedigree , Phenotype , Reproducibility of Results

14.

Genetic and Pharmacologic Inactivation of ANGPTL3 and Cardiovascular Disease.

Dewey, Frederick E; Gusarova, Viktoria; Dunbar, Richard L; O'Dushlaine, Colm; Schurmann, Claudia; Gottesman, Omri; McCarthy, Shane; Van Hout, Cristopher V; Bruse, Shannon; Dansky, Hayes M; Leader, Joseph B; Murray, Michael F; Ritchie, Marylyn D; Kirchner, H Lester; Habegger, Lukas; Lopez, Alex; Penn, John; Zhao, An; Shao, Weiping; Stahl, Neil; Murphy, Andrew J; Hamon, Sara; Bouzelmat, Aurelie; Zhang, Rick; Shumel, Brad; Pordy, Robert; Gipe, Daniel; Herman, Gary A; Sheu, Wayne H H; Lee, I-Te; Liang, Kae-Woei; Guo, Xiuqing; Rotter, Jerome I; Chen, Yii-Der I; Kraus, William E; Shah, Svati H; Damrauer, Scott; Small, Aeron; Rader, Daniel J; Wulff, Anders Berg; Nordestgaard, Børge G; Tybjærg-Hansen, Anne; van den Hoek, Anita M; Princen, Hans M G; Ledbetter, David H; Carey, David J; Overton, John D; Reid, Jeffrey G; Sasiela, William J; Banerjee, Poulabi.

N Engl J Med ; 377(3): 211-221, 2017 07 20.

Article in English | MEDLINE | ID: mdl-28538136

ABSTRACT

BACKGROUND: Loss-of-function variants in the angiopoietin-like 3 gene (ANGPTL3) have been associated with decreased plasma levels of triglycerides, low-density lipoprotein (LDL) cholesterol, and high-density lipoprotein (HDL) cholesterol. It is not known whether such variants or therapeutic antagonism of ANGPTL3 are associated with a reduced risk of atherosclerotic cardiovascular disease. METHODS: We sequenced the exons of ANGPTL3 in 58,335 participants in the DiscovEHR human genetics study. We performed tests of association for loss-of-function variants in ANGPTL3 with lipid levels and with coronary artery disease in 13,102 case patients and 40,430 controls from the DiscovEHR study, with follow-up studies involving 23,317 case patients and 107,166 controls from four population studies. We also tested the effects of a human monoclonal antibody, evinacumab, against Angptl3 in dyslipidemic mice and against ANGPTL3 in healthy human volunteers with elevated levels of triglycerides or LDL cholesterol. RESULTS: In the DiscovEHR study, participants with heterozygous loss-of-function variants in ANGPTL3 had significantly lower serum levels of triglycerides, HDL cholesterol, and LDL cholesterol than participants without these variants. Loss-of-function variants were found in 0.33% of case patients with coronary artery disease and in 0.45% of controls (adjusted odds ratio, 0.59; 95% confidence interval, 0.41 to 0.85; P=0.004). These results were confirmed in the follow-up studies. In dyslipidemic mice, inhibition of Angptl3 with evinacumab resulted in a greater decrease in atherosclerotic lesion area and necrotic content than a control antibody. In humans, evinacumab caused a dose-dependent placebo-adjusted reduction in fasting triglyceride levels of up to 76% and LDL cholesterol levels of up to 23%. CONCLUSIONS: Genetic and therapeutic antagonism of ANGPTL3 in humans and of Angptl3 in mice was associated with decreased levels of all three major lipid fractions and decreased odds of atherosclerotic cardiovascular disease. (Funded by Regeneron Pharmaceuticals and others; ClinicalTrials.gov number, NCT01749878 .).

Subject(s)

Angiopoietins/antagonists & inhibitors , Antibodies, Monoclonal/administration & dosage , Atherosclerosis/drug therapy , Coronary Artery Disease/genetics , Dyslipidemias/drug therapy , Lipids/blood , Mutation , Aged , Angiopoietin-Like Protein 3 , Angiopoietin-like Proteins , Angiopoietins/genetics , Animals , Antibodies, Monoclonal/adverse effects , Antibodies, Monoclonal/pharmacology , Atherosclerosis/metabolism , Cardiovascular Diseases/prevention & control , Coronary Artery Disease/metabolism , Disease Models, Animal , Dose-Response Relationship, Drug , Double-Blind Method , Dyslipidemias/blood , Female , Humans , Lipid Metabolism/drug effects , Male , Mice , Mice, Inbred Strains , Middle Aged

15.

Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the DiscovEHR study.

Dewey, Frederick E; Murray, Michael F; Overton, John D; Habegger, Lukas; Leader, Joseph B; Fetterolf, Samantha N; O'Dushlaine, Colm; Van Hout, Cristopher V; Staples, Jeffrey; Gonzaga-Jauregui, Claudia; Metpally, Raghu; Pendergrass, Sarah A; Giovanni, Monica A; Kirchner, H Lester; Balasubramanian, Suganthi; Abul-Husn, Noura S; Hartzel, Dustin N; Lavage, Daniel R; Kost, Korey A; Packer, Jonathan S; Lopez, Alexander E; Penn, John; Mukherjee, Semanti; Gosalia, Nehal; Kanagaraj, Manoj; Li, Alexander H; Mitnaul, Lyndon J; Adams, Lance J; Person, Thomas N; Praveen, Kavita; Marcketta, Anthony; Lebo, Matthew S; Austin-Tse, Christina A; Mason-Suares, Heather M; Bruse, Shannon; Mellis, Scott; Phillips, Robert; Stahl, Neil; Murphy, Andrew; Economides, Aris; Skelding, Kimberly A; Still, Christopher D; Elmore, James R; Borecki, Ingrid B; Yancopoulos, George D; Davis, F Daniel; Faucett, William A; Gottesman, Omri; Ritchie, Marylyn D; Shuldiner, Alan R.

Science ; 354(6319)2016 Dec 23.

Article in English | MEDLINE | ID: mdl-28008009

ABSTRACT

The DiscovEHR collaboration between the Regeneron Genetics Center and Geisinger Health System couples high-throughput sequencing to an integrated health care system using longitudinal electronic health records (EHRs). We sequenced the exomes of 50,726 adult participants in the DiscovEHR study to identify ~4.2 million rare single-nucleotide variants and insertion/deletion events, of which ~176,000 are predicted to result in a loss of gene function. Linking these data to EHR-derived clinical phenotypes, we find clinical associations supporting therapeutic targets, including genes encoding drug targets for lipid lowering, and identify previously unidentified rare alleles associated with lipid levels and other blood level traits. About 3.5% of individuals harbor deleterious variants in 76 clinically actionable genes. The DiscovEHR data set provides a blueprint for large-scale precision medicine initiatives and genomics-guided therapeutic discovery.

Subject(s)

Delivery of Health Care, Integrated , Disease/genetics , Electronic Health Records , Exome/genetics , High-Throughput Nucleotide Sequencing , Adult , Drug Design , Gene Frequency , Genomics , Humans , Hypolipidemic Agents/pharmacology , INDEL Mutation , Lipids/blood , Molecular Targeted Therapy , Polymorphism, Single Nucleotide , Sequence Analysis, DNA

16.

Inactivating Variants in ANGPTL4 and Risk of Coronary Artery Disease.

Dewey, Frederick E; Gusarova, Viktoria; O'Dushlaine, Colm; Gottesman, Omri; Trejos, Jesus; Hunt, Charleen; Van Hout, Cristopher V; Habegger, Lukas; Buckler, David; Lai, Ka-Man V; Leader, Joseph B; Murray, Michael F; Ritchie, Marylyn D; Kirchner, H Lester; Ledbetter, David H; Penn, John; Lopez, Alexander; Borecki, Ingrid B; Overton, John D; Reid, Jeffrey G; Carey, David J; Murphy, Andrew J; Yancopoulos, George D; Baras, Aris; Gromada, Jesper; Shuldiner, Alan R.

N Engl J Med ; 374(12): 1123-33, 2016 Mar 24.

Article in English | MEDLINE | ID: mdl-26933753

ABSTRACT

BACKGROUND: Higher-than-normal levels of circulating triglycerides are a risk factor for ischemic cardiovascular disease. Activation of lipoprotein lipase, an enzyme that is inhibited by angiopoietin-like 4 (ANGPTL4), has been shown to reduce levels of circulating triglycerides. METHODS: We sequenced the exons of ANGPTL4 in samples obtain from 42,930 participants of predominantly European ancestry in the DiscovEHR human genetics study. We performed tests of association between lipid levels and the missense E40K variant (which has been associated with reduced plasma triglyceride levels) and other inactivating mutations. We then tested for associations between coronary artery disease and the E40K variant and other inactivating mutations in 10,552 participants with coronary artery disease and 29,223 controls. We also tested the effect of a human monoclonal antibody against ANGPTL4 on lipid levels in mice and monkeys. RESULTS: We identified 1661 heterozygotes and 17 homozygotes for the E40K variant and 75 participants who had 13 other monoallelic inactivating mutations in ANGPTL4. The levels of triglycerides were 13% lower and the levels of high-density lipoprotein (HDL) cholesterol were 7% higher among carriers of the E40K variant than among noncarriers. Carriers of the E40K variant were also significantly less likely than noncarriers to have coronary artery disease (odds ratio, 0.81; 95% confidence interval, 0.70 to 0.92; P=0.002). K40 homozygotes had markedly lower levels of triglycerides and higher levels of HDL cholesterol than did heterozygotes. Carriers of other inactivating mutations also had lower triglyceride levels and higher HDL cholesterol levels and were less likely to have coronary artery disease than were noncarriers. Monoclonal antibody inhibition of Angptl4 in mice and monkeys reduced triglyceride levels. CONCLUSIONS: Carriers of E40K and other inactivating mutations in ANGPTL4 had lower levels of triglycerides and a lower risk of coronary artery disease than did noncarriers. The inhibition of Angptl4 in mice and monkeys also resulted in corresponding reductions in these values. (Funded by Regeneron Pharmaceuticals.).

Subject(s)

Angiopoietins/genetics , Coronary Artery Disease/genetics , Gene Silencing , Mutation , Aged , Angiopoietin-Like Protein 4 , Angiopoietins/antagonists & inhibitors , Animals , Cholesterol/blood , Disease Models, Animal , Female , Heterozygote , Humans , Macaca mulatta , Male , Mice , Middle Aged , Risk Factors , Triglycerides/blood

17.

CLAMMS: a scalable algorithm for calling common and rare copy number variants from exome sequencing data.

Packer, Jonathan S; Maxwell, Evan K; O'Dushlaine, Colm; Lopez, Alexander E; Dewey, Frederick E; Chernomorsky, Rostislav; Baras, Aris; Overton, John D; Habegger, Lukas; Reid, Jeffrey G.

Bioinformatics ; 32(1): 133-5, 2016 Jan 01.

Article in English | MEDLINE | ID: mdl-26382196

ABSTRACT

MOTIVATION: Several algorithms exist for detecting copy number variants (CNVs) from human exome sequencing read depth, but previous tools have not been well suited for large population studies on the order of tens or hundreds of thousands of exomes. Their limitations include being difficult to integrate into automated variant-calling pipelines and being ill-suited for detecting common variants. To address these issues, we developed a new algorithm--Copy number estimation using Lattice-Aligned Mixture Models (CLAMMS)--which is highly scalable and suitable for detecting CNVs across the whole allele frequency spectrum. RESULTS: In this note, we summarize the methods and intended use-case of CLAMMS, compare it to previous algorithms and briefly describe results of validation experiments. We evaluate the adherence of CNV calls from CLAMMS and four other algorithms to Mendelian inheritance patterns on a pedigree; we compare calls from CLAMMS and other algorithms to calls from SNP genotyping arrays for a set of 3164 samples; and we use TaqMan quantitative polymerase chain reaction to validate CNVs predicted by CLAMMS at 39 loci (95% of rare variants validate; across 19 common variant loci, the mean precision and recall are 99% and 94%, respectively). In the Supplementary Materials (available at the CLAMMS Github repository), we present our methods and validation results in greater detail. AVAILABILITY AND IMPLEMENTATION: https://github.com/rgcgithub/clamms (implemented in C). CONTACT: jeffrey.reid@regeneron.com SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Algorithms , DNA Copy Number Variations/genetics , Exome/genetics , Sequence Analysis, DNA/methods , Humans , Markov Chains , Reproducibility of Results

18.

Analysis of variable retroduplications in human populations suggests coupling of retrotransposition to cell division.

Abyzov, Alexej; Iskow, Rebecca; Gokcumen, Omer; Radke, David W; Balasubramanian, Suganthi; Pei, Baikang; Habegger, Lukas; Lee, Charles; Gerstein, Mark.

Genome Res ; 23(12): 2042-52, 2013 Dec.

Article in English | MEDLINE | ID: mdl-24026178

ABSTRACT

In primates and other animals, reverse transcription of mRNA followed by genomic integration creates retroduplications. Expressed retroduplications are either "retrogenes" coding for functioning proteins, or expressed "processed pseudogenes," which can function as noncoding RNAs. To date, little is known about the variation in retroduplications in terms of their presence or absence across individuals in the human population. We have developed new methodologies that allow us to identify "novel" retroduplications (i.e., those not present in the reference genome), to find their insertion points, and to genotype them. Using these methods, we catalogued and analyzed 174 retroduplication variants in almost one thousand humans, which were sequenced as part of Phase 1 of The 1000 Genomes Project Consortium. The accuracy of our data set was corroborated by (1) multiple lines of sequencing evidence for retroduplication (e.g., depth of coverage in exons vs. introns), (2) experimental validation, and (3) the fact that we can reconstruct a correct phylogenetic tree of human subpopulations based solely on retroduplications. We also show that parent genes of retroduplication variants tend to be expressed at the M-to-G1 transition in the cell cycle and that M-to-G1 expressed genes have more copies of fixed retroduplications than genes expressed at other times. These findings suggest that cell division is coupled to retrotransposition and, perhaps, is even a requirement for it.

Subject(s)

Cell Division/genetics , Gene Duplication , Retroelements/genetics , Computational Biology/methods , Evolution, Molecular , Genome, Human , Genotype , Humans , Phylogeny , Pseudogenes , Reproducibility of Results , Sequence Analysis, DNA

19.

Accurate identification and analysis of human mRNA isoforms using deep long read sequencing.

Tilgner, Hagen; Raha, Debasish; Habegger, Lukas; Mohiuddin, Mohammed; Gerstein, Mark; Snyder, Michael.

G3 (Bethesda) ; 3(3): 387-97, 2013 Mar.

Article in English | MEDLINE | ID: mdl-23450794

ABSTRACT

Precise identification of RNA-coding regions and transcriptomes of eukaryotes is a significant problem in biology. Currently, eukaryote transcriptomes are analyzed using deep short-read sequencing experiments of complementary DNAs. The resulting short-reads are then aligned against a genome and annotated junctions to infer biological meaning. Here we use long-read complementary DNA datasets for the analysis of a eukaryotic transcriptome and generate two large datasets in the human K562 and HeLa S3 cell lines. Both data sets comprised at least 4 million reads and had median read lengths greater than 500 bp. We show that annotation-independent alignments of these reads provide partial gene structures that are very much in-line with annotated gene structures, 15% of which have not been obtained in a previous de novo analysis of short reads. For long-noncoding RNAs (i.e., lncRNA) genes, however, we find an increased fraction of novel gene structures among our alignments. Other important aspects of transcriptome analysis, such as the description of cell type-specific splicing, can be performed in an accurate, reliable and completely annotation-free manner, making it ideal for the analysis of transcriptomes of newly sequenced genomes. Furthermore, we demonstrate that long read sequence can be assembled into full-length transcripts with considerable success. Our method is applicable to all long read sequencing technologies.

Subject(s)

Gene Expression Profiling/methods , RNA Isoforms/analysis , RNA, Long Noncoding/analysis , Alternative Splicing , Chromosome Mapping , DNA, Complementary/analysis , DNA, Complementary/genetics , Exons , Gene Library , Genome, Human , HeLa Cells , Humans , Introns , K562 Cells , Molecular Sequence Annotation , RNA Isoforms/genetics , RNA, Long Noncoding/genetics , Reproducibility of Results , Ribosomes/genetics , Sensitivity and Specificity , Sequence Analysis, RNA , Transcriptome

20.

The GENCODE pseudogene resource.

Pei, Baikang; Sisu, Cristina; Frankish, Adam; Howald, Cédric; Habegger, Lukas; Mu, Xinmeng Jasmine; Harte, Rachel; Balasubramanian, Suganthi; Tanzer, Andrea; Diekhans, Mark; Reymond, Alexandre; Hubbard, Tim J; Harrow, Jennifer; Gerstein, Mark B.

Genome Biol ; 13(9): R51, 2012 Sep 26.

Article in English | MEDLINE | ID: mdl-22951037

ABSTRACT

BACKGROUND: Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data. RESULTS: As part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large-scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection. CONCLUSIONS: At one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes.

Subject(s)

Genome, Human , Pseudogenes , Transcription, Genetic , Animals , Binding Sites , Chromatin/chemistry , Chromatin/genetics , Humans , Models, Genetic , Models, Statistical , Molecular Sequence Annotation , Phylogeny , Primates , RNA Polymerase II/metabolism , Regulatory Sequences, Nucleic Acid , Selection, Genetic , Sequence Analysis, DNA , Transcription Factors/metabolism

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL