RESUMEN
SARS-CoV-2 has infected over 340 million people, prompting therapeutic research. While genetic studies can highlight potential drug targets, understanding the heritability of SARS-CoV-2 susceptibility and COVID-19 severity can contextualize their results. To date, loci from meta-analyses explain 1.2% and 5.8% of variation in susceptibility and severity respectively. Here we estimate the importance of shared environment and additive genetic variation to SARS-CoV-2 susceptibility and COVID-19 severity using pedigree data, PCR results, and hospitalization information. The relative importance of genetics and shared environment for susceptibility shifted during the study, with heritability ranging from 33% (95% CI: 20%-46%) to 70% (95% CI: 63%-74%). Heritability was greater for days hospitalized with COVID-19 (41%, 95% CI: 33%-57%) compared to shared environment (33%, 95% CI: 24%-38%). While our estimates suggest these genetic architectures are not fully understood, the shift in susceptibility estimates highlights the challenge of estimation during a pandemic, given environmental fluctuations and vaccine introduction.
Asunto(s)
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , COVID-19/genética , Sistemas de Liberación de Medicamentos , Hospitalización , PandemiasRESUMEN
Important tasks in biomedical discovery such as predicting gene functions, gene-disease associations, and drug repurposing opportunities are often framed as network edge prediction. The number of edges connecting to a node, termed degree, can vary greatly across nodes in real biomedical networks, and the distribution of degrees varies between networks. If degree strongly influences edge prediction, then imbalance or bias in the distribution of degrees could lead to nonspecific or misleading predictions. We introduce a network permutation framework to quantify the effects of node degree on edge prediction. Our framework decomposes performance into the proportions attributable to degree and the network's specific connections using network permutation to generate features that depend only on degree. We discover that performance attributable to factors other than degree is often only a small portion of overall performance. Researchers seeking to predict new or missing edges in biological networks should use our permutation approach to obtain a baseline for performance that may be nonspecific because of degree. We released our methods as an open-source Python package (https://github.com/hetio/xswap/).
Asunto(s)
Algoritmos , ProbabilidadRESUMEN
A major obstacle hindering the broad adoption of polygenic scores (PGS) is their lack of "portability" to people that differ-in genetic ancestry or other characteristics-from the GWAS samples in which genetic effects were estimated. Here, we use the UK Biobank to measure the change in PGS prediction accuracy as a continuous function of individuals' genome-wide genetic dissimilarity to the GWAS sample ("genetic distance"). Our results highlight three gaps in our understanding of PGS portability. First, prediction accuracy is extremely noisy at the individual level and not well predicted by genetic distance. In fact, variance in prediction accuracy is explained comparably well by socioeconomic measures. Second, trends of portability vary across traits. For several immunity-related traits, prediction accuracy drops near zero quickly even at intermediate levels of genetic distance. This quick drop may reflect GWAS associations being more ancestry-specific in immunity-related traits than in other traits. Third, we show that even qualitative trends of portability can depend on the measure of prediction accuracy used. For instance, for white blood cell count, a measure of prediction accuracy at the individual level (reduction in mean squared error) increases with genetic distance. Together, our results show that portability cannot be understood through global ancestry groupings alone. There are other, understudied factors influencing portability, such as the specifics of the evolution of the trait and its genetic architecture, social context, and the construction of the polygenic score. Addressing these gaps can aid in the development and application of PGS and inform more equitable genomic research.
RESUMEN
Important tasks in biomedical discovery such as predicting gene functions, gene-disease associations, and drug repurposing opportunities are often framed as network edge prediction. The number of edges connecting to a node, termed degree, can vary greatly across nodes in real biomedical networks, and the distribution of degrees varies between networks. If degree strongly influences edge prediction, then imbalance or bias in the distribution of degrees could lead to nonspecific or misleading predictions. We introduce a network permutation framework to quantify the effects of node degree on edge prediction. Our framework decomposes performance into the proportions attributable to degree and the network's specific connections. We discover that performance attributable to factors other than degree is often only a small portion of overall performance. Degree's predictive performance diminishes when the networks used for training and testing-despite measuring the same biological relationships-were generated using distinct techniques and hence have large differences in degree distribution. We introduce the permutation-derived edge prior as the probability that an edge exists based only on degree. The edge prior shows excellent discrimination and calibration for 20 biomedical networks (16 bipartite, 3 undirected, 1 directed), with AUROCs frequently exceeding 0.85. Researchers seeking to predict new or missing edges in biological networks should use the edge prior as a baseline to identify the fraction of performance that is nonspecific because of degree. We released our methods as an open-source Python package (https://github.com/hetio/xswap/).
RESUMEN
Hetnets, short for "heterogeneous networks", contain multiple node and relationship types and offer a way to encode biomedical knowledge. One such example, Hetionet connects 11 types of nodes - including genes, diseases, drugs, pathways, and anatomical structures - with over 2 million edges of 24 types. Previous work has demonstrated that supervised machine learning methods applied to such networks can identify drug repurposing opportunities. However, a training set of known relationships does not exist for many types of node pairs, even when it would be useful to examine how nodes of those types are meaningfully connected. For example, users may be curious not only how metformin is related to breast cancer, but also how the GJA1 gene might be involved in insomnia. We developed a new procedure, termed hetnet connectivity search, that proposes important paths between any two nodes without requiring a supervised gold standard. The algorithm behind connectivity search identifies types of paths that occur more frequently than would be expected by chance (based on node degree alone). We find that predictions are broadly similar to those from previously described supervised approaches for certain node type pairs. Scoring of individual paths is based on the most specific paths of a given type. Several optimizations were required to precompute significant instances of node connectivity at the scale of large knowledge graphs. We implemented the method on Hetionet and provide an online interface at https://het.io/search . We provide an open source implementation of these methods in our new Python package named hetmatpy .
RESUMEN
BACKGROUND: Hetnets, short for "heterogeneous networks," contain multiple node and relationship types and offer a way to encode biomedical knowledge. One such example, Hetionet, connects 11 types of nodes-including genes, diseases, drugs, pathways, and anatomical structures-with over 2 million edges of 24 types. Previous work has demonstrated that supervised machine learning methods applied to such networks can identify drug repurposing opportunities. However, a training set of known relationships does not exist for many types of node pairs, even when it would be useful to examine how nodes of those types are meaningfully connected. For example, users may be curious about not only how metformin is related to breast cancer but also how a given gene might be involved in insomnia. FINDINGS: We developed a new procedure, termed hetnet connectivity search, that proposes important paths between any 2 nodes without requiring a supervised gold standard. The algorithm behind connectivity search identifies types of paths that occur more frequently than would be expected by chance (based on node degree alone). Several optimizations were required to precompute significant instances of node connectivity at the scale of large knowledge graphs. CONCLUSION: We implemented the method on Hetionet and provide an online interface at https://het.io/search. We provide an open-source implementation of these methods in our new Python package named hetmatpy.
Asunto(s)
Algoritmos , ProbabilidadRESUMEN
In less than nine months, the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) killed over a million people, including >25,000 in New York City (NYC) alone. The COVID-19 pandemic caused by SARS-CoV-2 highlights clinical needs to detect infection, track strain evolution, and identify biomarkers of disease course. To address these challenges, we designed a fast (30-minute) colorimetric test (LAMP) for SARS-CoV-2 infection from naso/oropharyngeal swabs and a large-scale shotgun metatranscriptomics platform (total-RNA-seq) for host, viral, and microbial profiling. We applied these methods to clinical specimens gathered from 669 patients in New York City during the first two months of the outbreak, yielding a broad molecular portrait of the emerging COVID-19 disease. We find significant enrichment of a NYC-distinctive clade of the virus (20C), as well as host responses in interferon, ACE, hematological, and olfaction pathways. In addition, we use 50,821 patient records to find that renin-angiotensin-aldosterone system inhibitors have a protective effect for severe COVID-19 outcomes, unlike similar drugs. Finally, spatial transcriptomic data from COVID-19 patient autopsy tissues reveal distinct ACE2 expression loci, with macrophage and neutrophil infiltration in the lungs. These findings can inform public health and may help develop and drive SARS-CoV-2 diagnostic, prevention, and treatment strategies.
Asunto(s)
COVID-19/genética , COVID-19/virología , SARS-CoV-2/genética , Adulto , Anciano , Antagonistas de Receptores de Angiotensina/farmacología , Inhibidores de la Enzima Convertidora de Angiotensina/farmacología , Antivirales/farmacología , COVID-19/epidemiología , Prueba de Ácido Nucleico para COVID-19 , Interacciones Farmacológicas , Femenino , Perfilación de la Expresión Génica , Genoma Viral , Antígenos HLA/genética , Interacciones Microbiota-Huesped/efectos de los fármacos , Interacciones Microbiota-Huesped/genética , Humanos , Masculino , Persona de Mediana Edad , Técnicas de Diagnóstico Molecular , Ciudad de Nueva York/epidemiología , Técnicas de Amplificación de Ácido Nucleico , Pandemias , RNA-Seq , SARS-CoV-2/clasificación , SARS-CoV-2/efectos de los fármacos , Tratamiento Farmacológico de COVID-19RESUMEN
The rapid global spread of the novel coronavirus SARS-CoV-2 has strained healthcare and testing resources, making the identification and prioritization of individuals most at-risk a critical challenge. Recent evidence suggests blood type may affect risk of severe COVID-19. Here, we use observational healthcare data on 14,112 individuals tested for SARS-CoV-2 with known blood type in the New York Presbyterian (NYP) hospital system to assess the association between ABO and Rh blood types and infection, intubation, and death. We find slightly increased infection prevalence among non-O types. Risk of intubation was decreased among A and increased among AB and B types, compared with type O, while risk of death was increased for type AB and decreased for types A and B. We estimate Rh-negative blood type to have a protective effect for all three outcomes. Our results add to the growing body of evidence suggesting blood type may play a role in COVID-19.
Asunto(s)
Sistema del Grupo Sanguíneo ABO , Betacoronavirus , Infecciones por Coronavirus/sangre , Neumonía Viral/sangre , Sistema del Grupo Sanguíneo Rh-Hr , Adulto , COVID-19 , Prueba de COVID-19 , Técnicas de Laboratorio Clínico , Infecciones por Coronavirus/diagnóstico , Infecciones por Coronavirus/epidemiología , Infecciones por Coronavirus/mortalidad , Humanos , Intubación Intratraqueal , Pandemias , Neumonía Viral/diagnóstico , Neumonía Viral/epidemiología , Neumonía Viral/mortalidad , Prevalencia , Factores de Riesgo , SARS-CoV-2 , Análisis de SupervivenciaRESUMEN
The rapid global spread of the novel coronavirus SARS-CoV-2 has strained healthcare and testing resources, making the identification and prioritization of individuals most at-risk a critical challenge. Recent evidence suggests blood type may affect risk of severe COVID-19. We used observational healthcare data on 14,112 individuals tested for SARS-CoV-2 with known blood type in the New York Presbyterian (NYP) hospital system to assess the association between ABO and Rh blood types and infection, intubation, and death. We found slightly increased infection prevalence among non-O types. Risk of intubation was decreased among A and increased among AB and B types, compared with type O, while risk of death was increased for type AB and decreased for types A and B. We estimated Rh-negative blood type to have a protective effect for all three outcomes. Our results add to the growing body of evidence suggesting blood type may play a role in COVID-19.
RESUMEN
BACKGROUND: Unsupervised compression algorithms applied to gene expression data extract latent or hidden signals representing technical and biological sources of variation. However, these algorithms require a user to select a biologically appropriate latent space dimensionality. In practice, most researchers fit a single algorithm and latent dimensionality. We sought to determine the extent by which selecting only one fit limits the biological features captured in the latent representations and, consequently, limits what can be discovered with subsequent analyses. RESULTS: We compress gene expression data from three large datasets consisting of adult normal tissue, adult cancer tissue, and pediatric cancer tissue. We train many different models across a large range of latent space dimensionalities and observe various performance differences. We identify more curated pathway gene sets significantly associated with individual dimensions in denoising autoencoder and variational autoencoder models trained using an intermediate number of latent dimensionalities. Combining compressed features across algorithms and dimensionalities captures the most pathway-associated representations. When trained with different latent dimensionalities, models learn strongly associated and generalizable biological representations including sex, neuroblastoma MYCN amplification, and cell types. Stronger signals, such as tumor type, are best captured in models trained at lower dimensionalities, while more subtle signals such as pathway activity are best identified in models trained with more latent dimensionalities. CONCLUSIONS: There is no single best latent dimensionality or compression algorithm for analyzing gene expression data. Instead, using features derived from different compression models across multiple latent space dimensionalities enhances biological representations.
Asunto(s)
Compresión de Datos/métodos , Expresión Génica , Modelos Biológicos , Adulto , Niño , Humanos , Neoplasias/metabolismo , Aprendizaje Automático SupervisadoRESUMEN
The Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has caused thousands of deaths worldwide, including >18,000 in New York City (NYC) alone. The sudden emergence of this pandemic has highlighted a pressing clinical need for rapid, scalable diagnostics that can detect infection, interrogate strain evolution, and identify novel patient biomarkers. To address these challenges, we designed a fast (30-minute) colorimetric test (LAMP) for SARS-CoV-2 infection from naso/oropharyngeal swabs, plus a large-scale shotgun metatranscriptomics platform (total-RNA-seq) for host, bacterial, and viral profiling. We applied both technologies across 857 SARS-CoV-2 clinical specimens and 86 NYC subway samples, providing a broad molecular portrait of the COVID-19 NYC outbreak. Our results define new features of SARS-CoV-2 evolution, nominate a novel, NYC-enriched viral subclade, reveal specific host responses in interferon, ACE, hematological, and olfaction pathways, and examine risks associated with use of ACE inhibitors and angiotensin receptor blockers. Together, these findings have immediate applications to SARS-CoV-2 diagnostics, public health, and new therapeutic targets.
RESUMEN
Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network's prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine.