Search | VHL Regional Portal

Interpretable and Predictive Deep Neural Network Modeling of the SARS-CoV-2 Spike Protein Sequence to Predict COVID-19 Disease Severity.

Sokhansanj, Bahrad A; Zhao, Zhengqiao; Rosen, Gail L.

Biology (Basel) ; 11(12)2022 Dec 08.

Article in English | MEDLINE | ID: mdl-36552295

ABSTRACT

Through the COVID-19 pandemic, SARS-CoV-2 has gained and lost multiple mutations in novel or unexpected combinations. Predicting how complex mutations affect COVID-19 disease severity is critical in planning public health responses as the virus continues to evolve. This paper presents a novel computational framework to complement conventional lineage classification and applies it to predict the severe disease potential of viral genetic variation. The transformer-based neural network model architecture has additional layers that provide sample embeddings and sequence-wide attention for interpretation and visualization. First, training a model to predict SARS-CoV-2 taxonomy validates the architecture's interpretability. Second, an interpretable predictive model of disease severity is trained on spike protein sequence and patient metadata from GISAID. Confounding effects of changing patient demographics, increasing vaccination rates, and improving treatment over time are addressed by including demographics and case date as independent input to the neural network model. The resulting model can be interpreted to identify potentially significant virus mutations and proves to be a robust predctive tool. Although trained on sequence data obtained entirely before the availability of empirical data for Omicron, the model can predict the Omicron's reduced risk of severe disease, in accord with epidemiological and experimental data.

Learning, visualizing and exploring 16S rRNA structure using an attention-based deep neural network.

Zhao, Zhengqiao; Woloszynek, Stephen; Agbavor, Felix; Mell, Joshua Chang; Sokhansanj, Bahrad A; Rosen, Gail L.

PLoS Comput Biol ; 17(9): e1009345, 2021 09.

Article in English | MEDLINE | ID: mdl-34550967

ABSTRACT

Recurrent neural networks with memory and attention mechanisms are widely used in natural language processing because they can capture short and long term sequential information for diverse tasks. We propose an integrated deep learning model for microbial DNA sequence data, which exploits convolutional neural networks, recurrent neural networks, and attention mechanisms to predict taxonomic classifications and sample-associated attributes, such as the relationship between the microbiome and host phenotype, on the read/sequence level. In this paper, we develop this novel deep learning approach and evaluate its application to amplicon sequences. We apply our approach to short DNA reads and full sequences of 16S ribosomal RNA (rRNA) marker genes, which identify the heterogeneity of a microbial community sample. We demonstrate that our implementation of a novel attention-based deep network architecture, Read2Pheno, achieves read-level phenotypic prediction. Training Read2Pheno models will encode sequences (reads) into dense, meaningful representations: learned embedded vectors output from the intermediate layer of the network model, which can provide biological insight when visualized. The attention layer of Read2Pheno models can also automatically identify nucleotide regions in reads/sequences which are particularly informative for classification. As such, this novel approach can avoid pre/post-processing and manual interpretation required with conventional approaches to microbiome sequence classification. We further show, as proof-of-concept, that aggregating read-level information can robustly predict microbial community properties, host phenotype, and taxonomic classification, with performance at least comparable to conventional approaches. An implementation of the attention-based deep learning network is available at https://github.com/EESI/sequence_attention (a python package) and https://github.com/EESI/seq2att (a command line tool).

Subject(s)

Deep Learning , Microbiota/genetics , Neural Networks, Computer , RNA, Ribosomal, 16S/genetics , Algorithms , Computational Biology , Databases, Genetic , Gastrointestinal Microbiome/genetics , Host Microbial Interactions/genetics , Humans , Inflammatory Bowel Diseases/microbiology , Natural Language Processing , Phenotype , Prevotella/classification , Prevotella/genetics , Prevotella/isolation & purification , Proof of Concept Study , RNA, Ribosomal, 16S/classification

Amino Acid k-mer Feature Extraction for Quantitative Antimicrobial Resistance (AMR) Prediction by Machine Learning and Model Interpretation for Biological Insights.

ValizadehAslani, Taha; Zhao, Zhengqiao; Sokhansanj, Bahrad A; Rosen, Gail L.

Biology (Basel) ; 9(11)2020 Oct 28.

Article in English | MEDLINE | ID: mdl-33126516

ABSTRACT

Machine learning algorithms can learn mechanisms of antimicrobial resistance from the data of DNA sequence without any a priori information. Interpreting a trained machine learning algorithm can be exploited for validating the model and obtaining new information about resistance mechanisms. Different feature extraction methods, such as SNP calling and counting nucleotide k-mers have been proposed for presenting DNA sequences to the model. However, there are trade-offs between interpretability, computational complexity and accuracy for different feature extraction methods. In this study, we have proposed a new feature extraction method, counting amino acid k-mers or oligopeptides, which provides easier model interpretation compared to counting nucleotide k-mers and reaches the same or even better accuracy in comparison with different methods. Additionally, we have trained machine learning algorithms using different feature extraction methods and compared the results in terms of accuracy, model interpretability and computational complexity. We have built a new feature selection pipeline for extraction of important features so that new AMR determinants can be discovered by analyzing these features. This pipeline allows the construction of models that only use a small number of features and can predict resistance accurately.

Genetic grouping of SARS-CoV-2 coronavirus sequences using informative subtype markers for pandemic spread visualization.

Zhao, Zhengqiao; Sokhansanj, Bahrad A; Malhotra, Charvi; Zheng, Kitty; Rosen, Gail L.

PLoS Comput Biol ; 16(9): e1008269, 2020 09.

Article in English | MEDLINE | ID: mdl-32941419

ABSTRACT

We propose an efficient framework for genetic subtyping of SARS-CoV-2, the novel coronavirus that causes the COVID-19 pandemic. Efficient viral subtyping enables visualization and modeling of the geographic distribution and temporal dynamics of disease spread. Subtyping thereby advances the development of effective containment strategies and, potentially, therapeutic and vaccine strategies. However, identifying viral subtypes in real-time is challenging: SARS-CoV-2 is a novel virus, and the pandemic is rapidly expanding. Viral subtypes may be difficult to detect due to rapid evolution; founder effects are more significant than selection pressure; and the clustering threshold for subtyping is not standardized. We propose to identify mutational signatures of available SARS-CoV-2 sequences using a population-based approach: an entropy measure followed by frequency analysis. These signatures, Informative Subtype Markers (ISMs), define a compact set of nucleotide sites that characterize the most variable (and thus most informative) positions in the viral genomes sequenced from different individuals. Through ISM compression, we find that certain distant nucleotide variants covary, including non-coding and ORF1ab sites covarying with the D614G spike protein mutation which has become increasingly prevalent as the pandemic has spread. ISMs are also useful for downstream analyses, such as spatiotemporal visualization of viral dynamics. By analyzing sequence data available in the GISAID database, we validate the utility of ISM-based subtyping by comparing spatiotemporal analyses using ISMs to epidemiological studies of viral transmission in Asia, Europe, and the United States. In addition, we show the relationship of ISMs to phylogenetic reconstructions of SARS-CoV-2 evolution, and therefore, ISMs can play an important complementary role to phylogenetic tree-based analysis, such as is done in the Nextstrain project. The developed pipeline dynamically generates ISMs for newly added SARS-CoV-2 sequences and updates the visualization of pandemic spatiotemporal dynamics, and is available on Github at https://github.com/EESI/ISM (Jupyter notebook), https://github.com/EESI/ncov_ism (command line tool) and via an interactive website at https://covid19-ism.coe.drexel.edu/.

Subject(s)

Betacoronavirus/classification , Betacoronavirus/genetics , Coronavirus Infections , Genomics/methods , Pandemics , Pneumonia, Viral , COVID-19 , Coronavirus Infections/epidemiology , Coronavirus Infections/transmission , Coronavirus Infections/virology , Evolution, Molecular , Genetic Markers/genetics , Genome, Viral/genetics , Humans , Mutation/genetics , Phylogeny , Pneumonia, Viral/epidemiology , Pneumonia, Viral/transmission , Pneumonia, Viral/virology , RNA, Viral/genetics , SARS-CoV-2 , Sequence Alignment , Sequence Analysis, RNA , Spatio-Temporal Analysis

Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life.

Zhao, Zhengqiao; Cristian, Alexandru; Rosen, Gail.

BMC Bioinformatics ; 21(1): 412, 2020 Sep 21.

Article in English | MEDLINE | ID: mdl-32957925

ABSTRACT

BACKGROUND: It is a computational challenge for current metagenomic classifiers to keep up with the pace of training data generated from genome sequencing projects, such as the exponentially-growing NCBI RefSeq bacterial genome database. When new reference sequences are added to training data, statically trained classifiers must be rerun on all data, resulting in a highly inefficient process. The rich literature of "incremental learning" addresses the need to update an existing classifier to accommodate new data without sacrificing much accuracy compared to retraining the classifier with all data. RESULTS: We demonstrate how classification improves over time by incrementally training a classifier on progressive RefSeq snapshots and testing it on: (a) all known current genomes (as a ground truth set) and (b) a real experimental metagenomic gut sample. We demonstrate that as a classifier model's knowledge of genomes grows, classification accuracy increases. The proof-of-concept naïve Bayes implementation, when updated yearly, now runs in 1/4th of the non-incremental time with no accuracy loss. CONCLUSIONS: It is evident that classification improves by having the most current knowledge at its disposal. Therefore, it is of utmost importance to make classifiers computationally tractable to keep up with the data deluge. The incremental learning classifier can be efficiently updated without the cost of reprocessing nor the access to the existing database and therefore save storage as well as computation resources.

Subject(s)

Gastrointestinal Microbiome/genetics , Genome, Bacterial , Machine Learning , Metagenomics/methods , Algorithms , Bacteria/genetics , Bayes Theorem , Humans , Metagenome , Sequence Analysis, DNA/methods

Emerging Priorities for Microbiome Research.

Cullen, Chad M; Aneja, Kawalpreet K; Beyhan, Sinem; Cho, Clara E; Woloszynek, Stephen; Convertino, Matteo; McCoy, Sophie J; Zhang, Yanyan; Anderson, Matthew Z; Alvarez-Ponce, David; Smirnova, Ekaterina; Karstens, Lisa; Dorrestein, Pieter C; Li, Hongzhe; Sen Gupta, Ananya; Cheung, Kevin; Powers, Jennifer Gloeckner; Zhao, Zhengqiao; Rosen, Gail L.

Front Microbiol ; 11: 136, 2020.

Article in English | MEDLINE | ID: mdl-32140140

ABSTRACT

Microbiome research has increased dramatically in recent years, driven by advances in technology and significant reductions in the cost of analysis. Such research has unlocked a wealth of data, which has yielded tremendous insight into the nature of the microbial communities, including their interactions and effects, both within a host and in an external environment as part of an ecological community. Understanding the role of microbiota, including their dynamic interactions with their hosts and other microbes, can enable the engineering of new diagnostic techniques and interventional strategies that can be used in a diverse spectrum of fields, spanning from ecology and agriculture to medicine and from forensics to exobiology. From June 19-23 in 2017, the NIH and NSF jointly held an Innovation Lab on Quantitative Approaches to Biomedical Data Science Challenges in our Understanding of the Microbiome. This review is inspired by some of the topics that arose as priority areas from this unique, interactive workshop. The goal of this review is to summarize the Innovation Lab's findings by introducing the reader to emerging challenges, exciting potential, and current directions in microbiome research. The review is broken into five key topic areas: (1) interactions between microbes and the human body, (2) evolution and ecology of microbes, including the role played by the environment and microbe-microbe interactions, (3) analytical and mathematical methods currently used in microbiome research, (4) leveraging knowledge of microbial composition and interactions to develop engineering solutions, and (5) interventional approaches and engineered microbiota that may be enabled by selectively altering microbial composition. As such, this review seeks to arm the reader with a broad understanding of the priorities and challenges in microbiome research today and provide inspiration for future investigation and multi-disciplinary collaboration.

Exploring thematic structure and predicted functionality of 16S rRNA amplicon data.

Woloszynek, Stephen; Mell, Joshua Chang; Zhao, Zhengqiao; Simpson, Gideon; O'Connor, Michael P; Rosen, Gail L.

PLoS One ; 14(12): e0219235, 2019.

Article in English | MEDLINE | ID: mdl-31825995

ABSTRACT

Analysis of microbiome data involves identifying co-occurring groups of taxa associated with sample features of interest (e.g., disease state). Elucidating such relations is often difficult as microbiome data are compositional, sparse, and have high dimensionality. Also, the configuration of co-occurring taxa may represent overlapping subcommunities that contribute to sample characteristics such as host status. Preserving the configuration of co-occurring microbes rather than detecting specific indicator species is more likely to facilitate biologically meaningful interpretations. Additionally, analyses that use taxonomic relative abundances to predict the abundances of different gene functions aggregate predicted functional profiles across taxa. This precludes straightforward identification of predicted functional components associated with subsets of co-occurring taxa. We provide an approach to explore co-occurring taxa using "topics" generated via a topic model and link these topics to specific sample features (e.g., disease state). Rather than inferring predicted functional content based on overall taxonomic relative abundances, we instead focus on inference of functional content within topics, which we parse by estimating interactions between topics and pathways through a multilevel, fully Bayesian regression model. We apply our methods to three publicly available 16S amplicon sequencing datasets: an inflammatory bowel disease dataset, an oral cancer dataset, and a time-series dataset. Using our topic model approach to uncover latent structure in 16S rRNA amplicon surveys, investigators can (1) capture groups of co-occurring taxa termed topics; (2) uncover within-topic functional potential; (3) link taxa co-occurrence, gene function, and environmental/host features; and (4) explore the way in which sets of co-occurring taxa behave and evolve over time. These methods have been implemented in a freely available R package: https://cran.r-project.org/package=themetagenomics, https://github.com/EESI/themetagenomics.

Subject(s)

Bacteria/classification , Bacteria/genetics , Biodiversity , Crohn Disease/microbiology , Metagenomics/methods , Mouth Neoplasms/microbiology , RNA, Ribosomal, 16S/genetics , Humans , Microbiota , Phylogeny , Sequence Analysis, DNA , Software , Time Factors

16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses.

Woloszynek, Stephen; Zhao, Zhengqiao; Chen, Jian; Rosen, Gail L.

PLoS Comput Biol ; 15(2): e1006721, 2019 02.

Article in English | MEDLINE | ID: mdl-30807567

ABSTRACT

Advances in high-throughput sequencing have increased the availability of microbiome sequencing data that can be exploited to characterize microbiome community structure in situ. We explore using word and sentence embedding approaches for nucleotide sequences since they may be a suitable numerical representation for downstream machine learning applications (especially deep learning). This work involves first encoding ("embedding") each sequence into a dense, low-dimensional, numeric vector space. Here, we use Skip-Gram word2vec to embed k-mers, obtained from 16S rRNA amplicon surveys, and then leverage an existing sentence embedding technique to embed all sequences belonging to specific body sites or samples. We demonstrate that these representations are meaningful, and hence the embedding space can be exploited as a form of feature extraction for exploratory analysis. We show that sequence embeddings preserve relevant information about the sequencing data such as k-mer context, sequence taxonomy, and sample class. Specifically, the sequence embedding space resolved differences among phyla, as well as differences among genera within the same family. Distances between sequence embeddings had similar qualities to distances between alignment identities, and embedding multiple sequences can be thought of as generating a consensus sequence. In addition, embeddings are versatile features that can be used for many downstream tasks, such as taxonomic and sample classification. Using sample embeddings for body site classification resulted in negligible performance loss compared to using OTU abundance data, and clustering embeddings yielded high fidelity species clusters. Lastly, the k-mer embedding space captured distinct k-mer profiles that mapped to specific regions of the 16S rRNA gene and corresponded with particular body sites. Together, our results show that embedding sequences results in meaningful representations that can be used for exploratory analyses or for downstream machine learning applications that require numeric data. Moreover, because the embeddings are trained in an unsupervised manner, unlabeled data can be embedded and used to bolster supervised machine learning tasks.

Subject(s)

RNA, Ribosomal, 16S/genetics , RNA, Ribosomal, 16S/physiology , Sequence Analysis, RNA/methods , Algorithms , Cluster Analysis , Computational Biology/methods , High-Throughput Nucleotide Sequencing/methods , Microbiota/genetics

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL