Search | Virtual Health Library

SomaticSiMu: a mutational signature simulator.

Chen, David; Randhawa, Gurjit S; Soltysiak, Maximillian P M; de Souza, Camila P E; Kari, Lila; Singh, Shiva M; Hill, Kathleen A.

Bioinformatics ; 38(9): 2619-2620, 2022 04 28.

Article in English | MEDLINE | ID: mdl-35258549

ABSTRACT

SUMMARY: SomaticSiMu is an in silico simulator of single and double base substitutions, and single base insertions and deletions in an input genomic sequence to mimic mutational signatures. SomaticSiMu outputs simulated DNA sequences and mutational catalogues with imposed mutational signatures. The tool is the first mutational signature simulator featuring a graphical user interface, control of mutation rates and built-in visualization tools of the simulated mutations. Simulated datasets are useful as a ground truth to test the accuracy and sensitivity of DNA sequence classification tools and mutational signature extraction tools under different experimental scenarios. The reliability of SomaticSiMu was affirmed by (i) supervised machine learning classification of simulated sequences with different mutation types and burdens, and (ii) mutational signature extraction from simulated mutational catalogues. AVAILABILITY AND IMPLEMENTATION: SomaticSiMu is written in Python 3.8.3. The open-source code, documentation and tutorials are available at https://github.com/HillLab/SomaticSiMu under the terms of the CreativeCommonsAttribution4.0InternationalLicense. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Genomics , Software , Reproducibility of Results , Mutation , Genome

MLDSP-GUI: an alignment-free standalone tool with an interactive graphical user interface for DNA sequence comparison and analysis.

Randhawa, Gurjit S; Hill, Kathleen A; Kari, Lila.

Bioinformatics ; 36(7): 2258-2259, 2020 04 01.

Article in English | MEDLINE | ID: mdl-31834361

ABSTRACT

SUMMARY: Machine Learning with Digital Signal Processing and Graphical User Interface (MLDSP-GUI) is an open-source, alignment-free, ultrafast, computationally lightweight, and standalone software tool with an interactive GUI for comparison and analysis of DNA sequences. MLDSP-GUI is a general-purpose tool that can be used for a variety of applications such as taxonomic classification, disease classification, virus subtype classification, evolutionary analyses, among others. AVAILABILITY AND IMPLEMENTATION: MLDSP-GUI is open-source, cross-platform compatible, and is available under the terms of the Creative Commons Attribution 4.0 International license (http://creativecommons.org/licenses/by/4.0/). The executable and dataset files are available at https://sourceforge.net/projects/mldsp-gui/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Software , User-Computer Interface , Base Sequence , Machine Learning , Signal Processing, Computer-Assisted

ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels.

Randhawa, Gurjit S; Hill, Kathleen A; Kari, Lila.

BMC Genomics ; 20(1): 267, 2019 Apr 03.

Article in English | MEDLINE | ID: mdl-30943897

ABSTRACT

BACKGROUND: Although software tools abound for the comparison, analysis, identification, and classification of genomic sequences, taxonomic classification remains challenging due to the magnitude of the datasets and the intrinsic problems associated with classification. The need exists for an approach and software tool that addresses the limitations of existing alignment-based methods, as well as the challenges of recently proposed alignment-free methods. RESULTS: We propose a novel combination of supervised Machine Learning with Digital Signal Processing, resulting in ML-DSP: an alignment-free software tool for ultrafast, accurate, and scalable genome classification at all taxonomic levels. We test ML-DSP by classifying 7396 full mitochondrial genomes at various taxonomic levels, from kingdom to genus, with an average classification accuracy of >97%. A quantitative comparison with state-of-the-art classification software tools is performed, on two small benchmark datasets and one large 4322 vertebrate mtDNA genomes dataset. Our results show that ML-DSP overwhelmingly outperforms the alignment-based software MEGA7 (alignment with MUSCLE or CLUSTALW) in terms of processing time, while having comparable classification accuracies for small datasets and superior accuracies for the large dataset. Compared with the alignment-free software FFP (Feature Frequency Profile), ML-DSP has significantly better classification accuracy, and is overall faster. We also provide preliminary experiments indicating the potential of ML-DSP to be used for other datasets, by classifying 4271 complete dengue virus genomes into subtypes with 100% accuracy, and 4,710 bacterial genomes into phyla with 95.5% accuracy. Lastly, our analysis shows that the "Purine/Pyrimidine", "Just-A" and "Real" numerical representations of DNA sequences outperform ten other such numerical representations used in the Digital Signal Processing literature for DNA classification purposes. CONCLUSIONS: Due to its superior classification accuracy, speed, and scalability to large datasets, ML-DSP is highly relevant in the classification of newly discovered organisms, in distinguishing genomic signatures and identifying their mechanistic determinants, and in evaluating genome integrity.

Subject(s)

Genome, Bacterial , Genome, Mitochondrial , Genome, Viral , Genomics/methods , Machine Learning , Signal Processing, Computer-Assisted , Software , Algorithms , Animals , Computer Simulation , Dengue Virus/genetics , Humans , Vertebrates/classification , Vertebrates/genetics

Environment and taxonomy shape the genomic signature of prokaryotic extremophiles.

Arias, Pablo Millán; Butler, Joseph; Randhawa, Gurjit S; Soltysiak, Maximillian P M; Hill, Kathleen A; Kari, Lila.

Sci Rep ; 13(1): 16105, 2023 09 26.

Article in English | MEDLINE | ID: mdl-37752120

ABSTRACT

This study provides comprehensive quantitative evidence suggesting that adaptations to extreme temperatures and pH imprint a discernible environmental component in the genomic signature of microbial extremophiles. Both supervised and unsupervised machine learning algorithms were used to analyze genomic signatures, each computed as the k-mer frequency vector of a 500 kbp DNA fragment arbitrarily selected to represent a genome. Computational experiments classified/clustered genomic signatures extracted from a curated dataset of [Formula: see text] extremophile (temperature, pH) bacteria and archaea genomes, at multiple scales of analysis, [Formula: see text]. The supervised learning resulted in high accuracies for taxonomic classifications at [Formula: see text], and medium to medium-high accuracies for environment category classifications of the same datasets at [Formula: see text]. For [Formula: see text], our findings were largely consistent with amino acid compositional biases and codon usage patterns in coding regions, previously attributed to extreme environment adaptations. The unsupervised learning of unlabelled sequences identified several exemplars of hyperthermophilic organisms with large similarities in their genomic signatures, in spite of belonging to different domains in the Tree of Life.

Subject(s)

Extremophiles , Extremophiles/genetics , Genomics/methods , Bacteria/genetics , Archaea/genetics , Genome, Archaeal/genetics

Mutational Patterns Observed in SARS-CoV-2 Genomes Sampled From Successive Epochs Delimited by Major Public Health Events in Ontario, Canada: Genomic Surveillance Study.

Chen, David; Randhawa, Gurjit S; Soltysiak, Maximillian Pm; de Souza, Camila Pe; Kari, Lila; Singh, Shiva M; Hill, Kathleen A.

JMIR Bioinform Biotechnol ; 3(1): e42243, 2022 Dec 22.

Article in English | MEDLINE | ID: mdl-38935965

ABSTRACT

BACKGROUND: The emergence of SARS-CoV-2 variants with mutations associated with increased transmissibility and virulence is a public health concern in Ontario, Canada. Characterizing how the mutational patterns of the SARS-CoV-2 genome have changed over time can shed light on the driving factors, including selection for increased fitness and host immune response, that may contribute to the emergence of novel variants. Moreover, the study of SARS-CoV-2 in the microcosm of Ontario, Canada can reveal how different province-specific public health policies over time may be associated with observed mutational patterns as a model system. OBJECTIVE: This study aimed to perform a comprehensive analysis of single base substitution (SBS) types, counts, and genomic locations observed in SARS-CoV-2 genomic sequences sampled in Ontario, Canada. Comparisons of mutational patterns were conducted between sequences sampled during 4 different epochs delimited by major public health events to track the evolution of the SARS-CoV-2 mutational landscape over 2 years. METHODS: In total, 24,244 SARS-CoV-2 genomic sequences and associated metadata sampled in Ontario, Canada from January 1, 2020, to December 31, 2021, were retrieved from the Global Initiative on Sharing All Influenza Data database. Sequences were assigned to 4 epochs delimited by major public health events based on the sampling date. SBSs from each SARS-CoV-2 sequence were identified relative to the MN996528.1 reference genome. Catalogues of SBS types and counts were generated to estimate the impact of selection in each open reading frame, and identify mutation clusters. The estimation of mutational fitness over time was performed using the Augur pipeline. RESULTS: The biases in SBS types and proportions observed support previous reports of host antiviral defense activity involving the SARS-CoV-2 genome. There was an increase in U>C substitutions associated with adenosine deaminase acting on RNA (ADAR) activity uniquely observed during Epoch 4. The burden of novel SBSs observed in SARS-CoV-2 genomic sequences was the greatest in Epoch 2 (median 5), followed by Epoch 3 (median 4). Clusters of SBSs were observed in the spike protein open reading frame, ORF1a, and ORF3a. The high proportion of nonsynonymous SBSs and increasing dN/dS metric (ratio of nonsynonymous to synonymous mutations in a given open reading frame) to above 1 in Epoch 4 indicate positive selection of the spike protein open reading frame. CONCLUSIONS: Quantitative analysis of the mutational patterns of the SARS-CoV-2 genome in the microcosm of Ontario, Canada within early consecutive epochs of the pandemic tracked the mutational dynamics in the context of public health events that instigate significant shifts in selection and mutagenesis. Continued genomic surveillance of emergent variants will be useful for the design of public health policies in response to the evolving COVID-19 pandemic.

Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study.

Randhawa, Gurjit S; Soltysiak, Maximillian P M; El Roz, Hadi; de Souza, Camila P E; Hill, Kathleen A; Kari, Lila.

PLoS One ; 15(4): e0232391, 2020.

Article in English | MEDLINE | ID: mdl-32330208

ABSTRACT

The 2019 novel coronavirus (renamed SARS-CoV-2, and generally referred to as the COVID-19 virus) has spread to 184 countries with over 1.5 million confirmed cases. Such major viral outbreaks demand early elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. This paper identifies an intrinsic COVID-19 virus genomic signature and uses it together with a machine learning-based alignment-free approach for an ultra-fast, scalable, and highly accurate classification of whole COVID-19 virus genomes. The proposed method combines supervised machine learning with digital signal processing (MLDSP) for genome analyses, augmented by a decision tree approach to the machine learning component, and a Spearman's rank correlation coefficient analysis for result validation. These tools are used to analyze a large dataset of over 5000 unique viral genomic sequences, totalling 61.8 million bp, including the 29 COVID-19 virus sequences available on January 27, 2020. Our results support a hypothesis of a bat origin and classify the COVID-19 virus as Sarbecovirus, within Betacoronavirus. Our method achieves 100% accurate classification of the COVID-19 virus sequences, and discovers the most relevant relationships among over 5000 viral genomes within a few minutes, ab initio, using raw DNA sequence data alone, and without any specialized biological knowledge, training, gene or genome annotations. This suggests that, for novel viral and pathogen genome sequences, this alignment-free whole-genome machine-learning approach can provide a reliable real-time option for taxonomic classification.

Subject(s)

Betacoronavirus/genetics , Coronavirus Infections/virology , Genome, Viral , Machine Learning , Pneumonia, Viral/virology , Betacoronavirus/classification , COVID-19 , Coronavirus Infections/epidemiology , Genomics , Humans , Pandemics , Pneumonia, Viral/epidemiology , SARS-CoV-2

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL