Search | VHL Regional Portal

Enhancing Hi-C contact matrices for loop detection with Capricorn: a multiview diffusion model.

Fang, Tangqi; Liu, Yifeng; Woicik, Addie; Lu, Minsi; Jha, Anupama; Wang, Xiao; Li, Gang; Hristov, Borislav; Liu, Zixuan; Xu, Hanwen; Noble, William S; Wang, Sheng.

Bioinformatics ; 40(Supplement_1): i471-i480, 2024 Jun 28.

Article in English | MEDLINE | ID: mdl-38940142

ABSTRACT

MOTIVATION: High-resolution Hi-C contact matrices reveal the detailed three-dimensional architecture of the genome, but high-coverage experimental Hi-C data are expensive to generate. Simultaneously, chromatin structure analyses struggle with extremely sparse contact matrices. To address this problem, computational methods to enhance low-coverage contact matrices have been developed, but existing methods are largely based on resolution enhancement methods for natural images and hence often employ models that do not distinguish between biologically meaningful contacts, such as loops and other stochastic contacts. RESULTS: We present Capricorn, a machine learning model for Hi-C resolution enhancement that incorporates small-scale chromatin features as additional views of the input Hi-C contact matrix and leverages a diffusion probability model backbone to generate a high-coverage matrix. We show that Capricorn outperforms the state of the art in a cross-cell-line setting, improving on existing methods by 17% in mean squared error and 26% in F1 score for chromatin loop identification from the generated high-coverage data. We also demonstrate that Capricorn performs well in the cross-chromosome setting and cross-chromosome, cross-cell-line setting, improving the downstream loop F1 score by 14% relative to existing methods. We further show that our multiview idea can also be used to improve several existing methods, HiCARN and HiCNN, indicating the wide applicability of this approach. Finally, we use DNA sequence to validate discovered loops and find that the fraction of CTCF-supported loops from Capricorn is similar to those identified from the high-coverage data. Capricorn is a powerful Hi-C resolution enhancement method that enables scientists to find chromatin features that cannot be identified in the low-coverage contact matrix. AVAILABILITY AND IMPLEMENTATION: Implementation of Capricorn and source code for reproducing all figures in this paper are available at https://github.com/CHNFTQ/Capricorn.

Subject(s)

Chromatin , Machine Learning , Chromatin/chemistry , Chromatin/metabolism , Humans , Computational Biology/methods , Algorithms , Software

Integrating socio-economic vulnerability factors improves neighborhood-scale wastewater-based epidemiology for public health applications.

Saingam, Prakit; Jain, Tanisha; Woicik, Addie; Li, Bo; Candry, Pieter; Redcorn, Raymond; Wang, Sheng; Himmelfarb, Jonathan; Bryan, Andrew; Winkler, Mari K H; Gattuso, Meghan.

Water Res ; 254: 121415, 2024 May 01.

Article in English | MEDLINE | ID: mdl-38479175

ABSTRACT

Wastewater Based Epidemiology (WBE) of COVID-19 is a low-cost, non-invasive, and inclusive early warning tool for disease spread. Previously studied WBE focused on sampling at wastewater treatment plant scale, limiting the level at which demographic and geographic variations in disease dynamics can be incorporated into the analysis of certain neighborhoods. This study demonstrates the integration of demographic mapping to improve the WBE of COVID-19 and associated post-COVID disease prediction (here kidney disease) at the neighborhood level using machine learning. WBE was conducted at six neighborhoods in Seattle during October 2020 - February 2022. Wastewater processing and RT-qPCR were performed to obtain SARS-CoV-2 RNA concentration. Census data, clinical data of COVID-19, as well as patient data of acute kidney injury (AKI) cases reported during the study period were collected and the distribution across the city was studied using Geographic Information System (GIS) mapping. Further, we analyzed the data set to better understand socioeconomic impacts on disease prevalence of COVID-19 and AKI per neighborhood. The heterogeneity of eleven demographic factors (such as education and age among others) was observed within neighborhoods across the city of Seattle. Dynamics of COVID-19 clinical cases and wastewater SARS-CoV-2 varied across neighborhood with different levels of demographics. Machine learning models trained with data from the earlier stages of the pandemic were able to predict both COVID-19 and AKI incidence in the later stages of the pandemic (Spearman correlation coefficient of 0·546 - 0·904), with the most predictive model trained on the combination of wastewater data and demographics. The integration of demographics strengthened machine learning models' capabilities to predict prevalence of COVID-19, and of AKI as a marker for post-COVID sequelae. Demographic-based WBE presents an effective tool to monitor and manage public health beyond COVID-19 at the neighborhood level.

Subject(s)

Acute Kidney Injury , COVID-19 , Humans , Public Health , RNA, Viral , Wastewater , Wastewater-Based Epidemiological Monitoring , COVID-19/epidemiology , Socioeconomic Factors

Gemini: memory-efficient integration of hundreds of gene networks with high-order pooling.

Woicik, Addie; Zhang, Mingxin; Xu, Hanwen; Mostafavi, Sara; Wang, Sheng.

Bioinformatics ; 39(39 Suppl 1): i504-i512, 2023 06 30.

Article in English | MEDLINE | ID: mdl-37387142

ABSTRACT

MOTIVATION: The exponential growth of genomic sequencing data has created ever-expanding repositories of gene networks. Unsupervised network integration methods are critical to learn informative representations for each gene, which are later used as features for downstream applications. However, these network integration methods must be scalable to account for the increasing number of networks and robust to an uneven distribution of network types within hundreds of gene networks. RESULTS: To address these needs, we present Gemini, a novel network integration method that uses memory-efficient high-order pooling to represent and weight each network according to its uniqueness. Gemini then mitigates the uneven network distribution through mixing up existing networks to create many new networks. We find that Gemini leads to more than a 10% improvement in F1 score, 15% improvement in micro-AUPRC, and 63% improvement in macro-AUPRC for human protein function prediction by integrating hundreds of networks from BioGRID, and that Gemini's performance significantly improves when more networks are added to the input network collection, while Mashup and BIONIC embeddings' performance deteriorates. Gemini thereby enables memory-efficient and informative network integration for large gene networks and can be used to massively integrate and analyze networks in other domains. AVAILABILITY AND IMPLEMENTATION: Gemini can be accessed at: https://github.com/MinxZ/Gemini.

Subject(s)

Gene Regulatory Networks , Genomics , Humans , Chromosome Mapping

Multilingual translation for zero-shot biomedical classification using BioTranslator.

Xu, Hanwen; Woicik, Addie; Poon, Hoifung; Altman, Russ B; Wang, Sheng.

Nat Commun ; 14(1): 738, 2023 02 10.

Article in English | MEDLINE | ID: mdl-36759510

ABSTRACT

Existing annotation paradigms rely on controlled vocabularies, where each data instance is classified into one term from a predefined set of controlled vocabularies. This paradigm restricts the analysis to concepts that are known and well-characterized. Here, we present the novel multilingual translation method BioTranslator to address this problem. BioTranslator takes a user-written textual description of a new concept and then translates this description to a non-text biological data instance. The key idea of BioTranslator is to develop a multilingual translation framework, where multiple modalities of biological data are all translated to text. We demonstrate how BioTranslator enables the identification of novel cell types using only a textual description and how BioTranslator can be further generalized to protein function prediction and drug target identification. Our tool frees scientists from limiting their analyses within predefined controlled vocabularies, enabling them to interact with biological data using free text.

Subject(s)

Multilingualism , Vocabulary, Controlled , Proteins

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL