Pesquisa | Portal Regional da BVS

1.

Environment and taxonomy shape the genomic signature of prokaryotic extremophiles.

Arias, Pablo Millán; Butler, Joseph; Randhawa, Gurjit S; Soltysiak, Maximillian P M; Hill, Kathleen A; Kari, Lila.

Sci Rep ; 13(1): 16105, 2023 09 26.

Artigo em Inglês | MEDLINE | ID: mdl-37752120

RESUMO

This study provides comprehensive quantitative evidence suggesting that adaptations to extreme temperatures and pH imprint a discernible environmental component in the genomic signature of microbial extremophiles. Both supervised and unsupervised machine learning algorithms were used to analyze genomic signatures, each computed as the k-mer frequency vector of a 500 kbp DNA fragment arbitrarily selected to represent a genome. Computational experiments classified/clustered genomic signatures extracted from a curated dataset of [Formula: see text] extremophile (temperature, pH) bacteria and archaea genomes, at multiple scales of analysis, [Formula: see text]. The supervised learning resulted in high accuracies for taxonomic classifications at [Formula: see text], and medium to medium-high accuracies for environment category classifications of the same datasets at [Formula: see text]. For [Formula: see text], our findings were largely consistent with amino acid compositional biases and codon usage patterns in coding regions, previously attributed to extreme environment adaptations. The unsupervised learning of unlabelled sequences identified several exemplars of hyperthermophilic organisms with large similarities in their genomic signatures, in spite of belonging to different domains in the Tree of Life.

Assuntos

Extremófilos , Extremófilos/genética , Genômica/métodos , Bactérias/genética , Archaea/genética , Genoma Arqueal/genética

2.

iDeLUCS: a deep learning interactive tool for alignment-free clustering of DNA sequences.

Millan Arias, Pablo; Hill, Kathleen A; Kari, Lila.

Bioinformatics ; 39(9)2023 09 02.

Artigo em Inglês | MEDLINE | ID: mdl-37589603

RESUMO

SUMMARY: We present an interactive Deep Learning-based software tool for Unsupervised Clustering of DNA Sequences (iDeLUCS), that detects genomic signatures and uses them to cluster DNA sequences, without the need for sequence alignment or taxonomic identifiers. iDeLUCS is scalable and user-friendly: its graphical user interface, with support for hardware acceleration, allows the practitioner to fine-tune the different hyper-parameters involved in the training process without requiring extensive knowledge of deep learning. The performance of iDeLUCS was evaluated on a diverse set of datasets: several real genomic datasets from organisms in kingdoms Animalia, Protista, Fungi, Bacteria, and Archaea, three datasets of viral genomes, a dataset of simulated metagenomic reads from microbial genomes, and multiple datasets of synthetic DNA sequences. The performance of iDeLUCS was compared to that of two classical clustering algorithms (k-means++ and GMM) and two clustering algorithms specialized in DNA sequences (MeShClust v3.0 and DeLUCS), using both intrinsic cluster evaluation metrics and external evaluation metrics. In terms of unsupervised clustering accuracy, iDeLUCS outperforms the two classical algorithms by an average of â¼20%, and the two specialized algorithms by an average of â¼12%, on the datasets of real DNA sequences analyzed. Overall, our results indicate that iDeLUCS is a robust clustering method suitable for the clustering of large and diverse datasets of unlabeled DNA sequences. AVAILABILITY AND IMPLEMENTATION: iDeLUCS is available at https://github.com/Kari-Genomics-Lab/iDeLUCS under the terms of the MIT licence.

Assuntos

Aprendizado Profundo , Sequência de Bases , Algoritmos , Archaea , Análise por Conglomerados

3.

MT-MAG: Accurate and interpretable machine learning for complete or partial taxonomic assignments of metagenomeassembled genomes.

Li, Wanxin; Kari, Lila; Yu, Yaoliang; Hug, Laura A.

PLoS One ; 18(8): e0283536, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37594964

RESUMO

We propose MT-MAG, a novel machine learning-based software tool for the complete or partial hierarchically-structured taxonomic classification of metagenome-assembled genomes (MAGs). MT-MAG is alignment-free, with k-mer frequencies being the only feature used to distinguish a DNA sequence from another (herein k = 7). MT-MAG is capable of classifying large and diverse metagenomic datasets: a total of 245.68 Gbp in the training sets, and 9.6 Gbp in the test sets analyzed in this study. In addition to complete classifications, MT-MAG offers a "partial classification" option, whereby a classification at a higher taxonomic level is provided for MAGs that cannot be classified to the Species level. MT-MAG outputs complete or partial classification paths, and interpretable numerical classification confidences of its classifications, at all taxonomic ranks. To assess the performance of MT-MAG, we define a "weighted classification accuracy," with a weighting scheme reflecting the fact that partial classifications at different ranks are not equally informative. For the two benchmarking datasets analyzed (genomes from human gut microbiome species, and bacterial and archaeal genomes assembled from cow rumen metagenomic sequences), MT-MAG achieves an average of 87.32% in weighted classification accuracy. At the Species level, MT-MAG outperforms DeepMicrobes, the only other comparable software tool, by an average of 34.79% in weighted classification accuracy. In addition, MT-MAG is able to completely classify an average of 67.70% of the sequences at the Species level, compared with DeepMicrobes which only classifies 47.45%. Moreover, MT-MAG provides additional information for sequences that it could not classify at the Species level, resulting in the partial or complete classification of 95.13%, of the genomes in the datasets analyzed. Lastly, unlike other taxonomic assignment tools (e.g., GDTB-Tk), MT-MAG is an alignment-free and genetic marker-free tool, able to provide additional bioinformatics analysis to confirm existing or tentative taxonomic assignments.

Assuntos

Microbioma Gastrointestinal , Metagenoma , Animais , Bovinos , Feminino , Humanos , Metagenoma/genética , Benchmarking , Biologia Computacional , Aprendizado de Máquina

4.

Leveraging machine learning for taxonomic classification of emerging astroviruses.

Alipour, Fatemeh; Holmes, Connor; Lu, Yang Young; Hill, Kathleen A; Kari, Lila.

Front Mol Biosci ; 10: 1305506, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-38274100

RESUMO

Astroviruses are a family of genetically diverse viruses associated with disease in humans and birds with significant health effects and economic burdens. Astrovirus taxonomic classification includes two genera, Avastrovirus and Mamastrovirus. However, with next-generation sequencing, broader interspecies transmission has been observed necessitating a reexamination of the current host-based taxonomic classification approach. In this study, a novel taxonomic classification method is presented for emergent and as yet unclassified astroviruses, based on whole genome sequence k-mer composition in addition to host information. An optional component responsible for identifying recombinant sequences was added to the method's pipeline, to counteract the impact of genetic recombination on viral classification. The proposed three-pronged classification method consists of a supervised machine learning method, an unsupervised machine learning method, and the consideration of host species. Using this three-pronged approach, we propose genus labels for 191 as yet unclassified astrovirus genomes. Genus labels are also suggested for an additional eight as yet unclassified astrovirus genomes for which incompatibility was observed with the host species, suggesting cross-species infection. Lastly, our machine learning-based approach augmented by a principal component analysis (PCA) analysis provides evidence supporting the hypothesis of the existence of human astrovirus (HAstV) subgenus of the genus Mamastrovirus, and a goose astrovirus (GoAstV) subgenus of the genus Avastrovirus. Overall, this multipronged machine learning approach provides a fast, reliable, and scalable prediction method of taxonomic labels, able to keep pace with emerging viruses and the exponential increase in the output of modern genome sequencing technologies.

5.

SomaticSiMu: a mutational signature simulator.

Chen, David; Randhawa, Gurjit S; Soltysiak, Maximillian P M; de Souza, Camila P E; Kari, Lila; Singh, Shiva M; Hill, Kathleen A.

Bioinformatics ; 38(9): 2619-2620, 2022 04 28.

Artigo em Inglês | MEDLINE | ID: mdl-35258549

RESUMO

SUMMARY: SomaticSiMu is an in silico simulator of single and double base substitutions, and single base insertions and deletions in an input genomic sequence to mimic mutational signatures. SomaticSiMu outputs simulated DNA sequences and mutational catalogues with imposed mutational signatures. The tool is the first mutational signature simulator featuring a graphical user interface, control of mutation rates and built-in visualization tools of the simulated mutations. Simulated datasets are useful as a ground truth to test the accuracy and sensitivity of DNA sequence classification tools and mutational signature extraction tools under different experimental scenarios. The reliability of SomaticSiMu was affirmed by (i) supervised machine learning classification of simulated sequences with different mutation types and burdens, and (ii) mutational signature extraction from simulated mutational catalogues. AVAILABILITY AND IMPLEMENTATION: SomaticSiMu is written in Python 3.8.3. The open-source code, documentation and tutorials are available at https://github.com/HillLab/SomaticSiMu under the terms of the CreativeCommonsAttribution4.0InternationalLicense. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Genômica , Software , Reprodutibilidade dos Testes , Mutação , Genoma

6.

DeLUCS: Deep learning for unsupervised clustering of DNA sequences.

Millán Arias, Pablo; Alipour, Fatemeh; Hill, Kathleen A; Kari, Lila.

PLoS One ; 17(1): e0261531, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-35061715

RESUMO

We present a novel Deep Learning method for the Unsupervised Clustering of DNA Sequences (DeLUCS) that does not require sequence alignment, sequence homology, or (taxonomic) identifiers. DeLUCS uses Frequency Chaos Game Representations (FCGR) of primary DNA sequences, and generates "mimic" sequence FCGRs to self-learn data patterns (genomic signatures) through the optimization of multiple neural networks. A majority voting scheme is then used to determine the final cluster assignment for each sequence. The clusters learned by DeLUCS match true taxonomic groups for large and diverse datasets, with accuracies ranging from 77% to 100%: 2,500 complete vertebrate mitochondrial genomes, at taxonomic levels from sub-phylum to genera; 3,200 randomly selected 400 kbp-long bacterial genome segments, into clusters corresponding to bacterial families; three viral genome and gene datasets, averaging 1,300 sequences each, into clusters corresponding to virus subtypes. DeLUCS significantly outperforms two classic clustering methods (K-means++ and Gaussian Mixture Models) for unlabelled data, by as much as 47%. DeLUCS is highly effective, it is able to cluster datasets of unlabelled primary DNA sequences totalling over 1 billion bp of data, and it bypasses common limitations to classification resulting from the lack of sequence homology, variation in sequence length, and the absence or instability of sequence annotations and taxonomic identifiers. Thus, DeLUCS offers fast and accurate DNA sequence clustering for previously intractable datasets.

Assuntos

Aprendizado Profundo

7.

Mutational Patterns Observed in SARS-CoV-2 Genomes Sampled From Successive Epochs Delimited by Major Public Health Events in Ontario, Canada: Genomic Surveillance Study.

Chen, David; Randhawa, Gurjit S; Soltysiak, Maximillian Pm; de Souza, Camila Pe; Kari, Lila; Singh, Shiva M; Hill, Kathleen A.

JMIR Bioinform Biotechnol ; 3(1): e42243, 2022 Dec 22.

Artigo em Inglês | MEDLINE | ID: mdl-38935965

RESUMO

BACKGROUND: The emergence of SARS-CoV-2 variants with mutations associated with increased transmissibility and virulence is a public health concern in Ontario, Canada. Characterizing how the mutational patterns of the SARS-CoV-2 genome have changed over time can shed light on the driving factors, including selection for increased fitness and host immune response, that may contribute to the emergence of novel variants. Moreover, the study of SARS-CoV-2 in the microcosm of Ontario, Canada can reveal how different province-specific public health policies over time may be associated with observed mutational patterns as a model system. OBJECTIVE: This study aimed to perform a comprehensive analysis of single base substitution (SBS) types, counts, and genomic locations observed in SARS-CoV-2 genomic sequences sampled in Ontario, Canada. Comparisons of mutational patterns were conducted between sequences sampled during 4 different epochs delimited by major public health events to track the evolution of the SARS-CoV-2 mutational landscape over 2 years. METHODS: In total, 24,244 SARS-CoV-2 genomic sequences and associated metadata sampled in Ontario, Canada from January 1, 2020, to December 31, 2021, were retrieved from the Global Initiative on Sharing All Influenza Data database. Sequences were assigned to 4 epochs delimited by major public health events based on the sampling date. SBSs from each SARS-CoV-2 sequence were identified relative to the MN996528.1 reference genome. Catalogues of SBS types and counts were generated to estimate the impact of selection in each open reading frame, and identify mutation clusters. The estimation of mutational fitness over time was performed using the Augur pipeline. RESULTS: The biases in SBS types and proportions observed support previous reports of host antiviral defense activity involving the SARS-CoV-2 genome. There was an increase in U>C substitutions associated with adenosine deaminase acting on RNA (ADAR) activity uniquely observed during Epoch 4. The burden of novel SBSs observed in SARS-CoV-2 genomic sequences was the greatest in Epoch 2 (median 5), followed by Epoch 3 (median 4). Clusters of SBSs were observed in the spike protein open reading frame, ORF1a, and ORF3a. The high proportion of nonsynonymous SBSs and increasing dN/dS metric (ratio of nonsynonymous to synonymous mutations in a given open reading frame) to above 1 in Epoch 4 indicate positive selection of the spike protein open reading frame. CONCLUSIONS: Quantitative analysis of the mutational patterns of the SARS-CoV-2 genome in the microcosm of Ontario, Canada within early consecutive epochs of the pandemic tracked the mutational dynamics in the context of public health events that instigate significant shifts in selection and mutagenesis. Continued genomic surveillance of emergent variants will be useful for the design of public health policies in response to the evolving COVID-19 pandemic.

8.

Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study.

Randhawa, Gurjit S; Soltysiak, Maximillian P M; El Roz, Hadi; de Souza, Camila P E; Hill, Kathleen A; Kari, Lila.

PLoS One ; 15(4): e0232391, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-32330208

RESUMO

The 2019 novel coronavirus (renamed SARS-CoV-2, and generally referred to as the COVID-19 virus) has spread to 184 countries with over 1.5 million confirmed cases. Such major viral outbreaks demand early elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. This paper identifies an intrinsic COVID-19 virus genomic signature and uses it together with a machine learning-based alignment-free approach for an ultra-fast, scalable, and highly accurate classification of whole COVID-19 virus genomes. The proposed method combines supervised machine learning with digital signal processing (MLDSP) for genome analyses, augmented by a decision tree approach to the machine learning component, and a Spearman's rank correlation coefficient analysis for result validation. These tools are used to analyze a large dataset of over 5000 unique viral genomic sequences, totalling 61.8 million bp, including the 29 COVID-19 virus sequences available on January 27, 2020. Our results support a hypothesis of a bat origin and classify the COVID-19 virus as Sarbecovirus, within Betacoronavirus. Our method achieves 100% accurate classification of the COVID-19 virus sequences, and discovers the most relevant relationships among over 5000 viral genomes within a few minutes, ab initio, using raw DNA sequence data alone, and without any specialized biological knowledge, training, gene or genome annotations. This suggests that, for novel viral and pathogen genome sequences, this alignment-free whole-genome machine-learning approach can provide a reliable real-time option for taxonomic classification.

Assuntos

Betacoronavirus/genética , Infecções por Coronavirus/virologia , Genoma Viral , Aprendizado de Máquina , Pneumonia Viral/virologia , Betacoronavirus/classificação , COVID-19 , Infecções por Coronavirus/epidemiologia , Genômica , Humanos , Pandemias , Pneumonia Viral/epidemiologia , SARS-CoV-2

9.

MLDSP-GUI: an alignment-free standalone tool with an interactive graphical user interface for DNA sequence comparison and analysis.

Randhawa, Gurjit S; Hill, Kathleen A; Kari, Lila.

Bioinformatics ; 36(7): 2258-2259, 2020 04 01.

Artigo em Inglês | MEDLINE | ID: mdl-31834361

RESUMO

SUMMARY: Machine Learning with Digital Signal Processing and Graphical User Interface (MLDSP-GUI) is an open-source, alignment-free, ultrafast, computationally lightweight, and standalone software tool with an interactive GUI for comparison and analysis of DNA sequences. MLDSP-GUI is a general-purpose tool that can be used for a variety of applications such as taxonomic classification, disease classification, virus subtype classification, evolutionary analyses, among others. AVAILABILITY AND IMPLEMENTATION: MLDSP-GUI is open-source, cross-platform compatible, and is available under the terms of the Creative Commons Attribution 4.0 International license (http://creativecommons.org/licenses/by/4.0/). The executable and dataset files are available at https://sourceforge.net/projects/mldsp-gui/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Software , Interface Usuário-Computador , Sequência de Bases , Aprendizado de Máquina , Processamento de Sinais Assistido por Computador

10.

ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels.

Randhawa, Gurjit S; Hill, Kathleen A; Kari, Lila.

BMC Genomics ; 20(1): 267, 2019 Apr 03.

Artigo em Inglês | MEDLINE | ID: mdl-30943897

RESUMO

BACKGROUND: Although software tools abound for the comparison, analysis, identification, and classification of genomic sequences, taxonomic classification remains challenging due to the magnitude of the datasets and the intrinsic problems associated with classification. The need exists for an approach and software tool that addresses the limitations of existing alignment-based methods, as well as the challenges of recently proposed alignment-free methods. RESULTS: We propose a novel combination of supervised Machine Learning with Digital Signal Processing, resulting in ML-DSP: an alignment-free software tool for ultrafast, accurate, and scalable genome classification at all taxonomic levels. We test ML-DSP by classifying 7396 full mitochondrial genomes at various taxonomic levels, from kingdom to genus, with an average classification accuracy of >97%. A quantitative comparison with state-of-the-art classification software tools is performed, on two small benchmark datasets and one large 4322 vertebrate mtDNA genomes dataset. Our results show that ML-DSP overwhelmingly outperforms the alignment-based software MEGA7 (alignment with MUSCLE or CLUSTALW) in terms of processing time, while having comparable classification accuracies for small datasets and superior accuracies for the large dataset. Compared with the alignment-free software FFP (Feature Frequency Profile), ML-DSP has significantly better classification accuracy, and is overall faster. We also provide preliminary experiments indicating the potential of ML-DSP to be used for other datasets, by classifying 4271 complete dengue virus genomes into subtypes with 100% accuracy, and 4,710 bacterial genomes into phyla with 95.5% accuracy. Lastly, our analysis shows that the "Purine/Pyrimidine", "Just-A" and "Real" numerical representations of DNA sequences outperform ten other such numerical representations used in the Digital Signal Processing literature for DNA classification purposes. CONCLUSIONS: Due to its superior classification accuracy, speed, and scalability to large datasets, ML-DSP is highly relevant in the classification of newly discovered organisms, in distinguishing genomic signatures and identifying their mechanistic determinants, and in evaluating genome integrity.

Assuntos

Genoma Bacteriano , Genoma Mitocondrial , Genoma Viral , Genômica/métodos , Aprendizado de Máquina , Processamento de Sinais Assistido por Computador , Software , Algoritmos , Animais , Simulação por Computador , Vírus da Dengue/genética , Humanos , Vertebrados/classificação , Vertebrados/genética

11.

An open-source k-mer based machine learning tool for fast and accurate subtyping of HIV-1 genomes.

Solis-Reyes, Stephen; Avino, Mariano; Poon, Art; Kari, Lila.

PLoS One ; 13(11): e0206409, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-30427878

RESUMO

For many disease-causing virus species, global diversity is clustered into a taxonomy of subtypes with clinical significance. In particular, the classification of infections among the subtypes of human immunodeficiency virus type 1 (HIV-1) is a routine component of clinical management, and there are now many classification algorithms available for this purpose. Although several of these algorithms are similar in accuracy and speed, the majority are proprietary and require laboratories to transmit HIV-1 sequence data over the network to remote servers. This potentially exposes sensitive patient data to unauthorized access, and makes it impossible to determine how classifications are made and to maintain the data provenance of clinical bioinformatic workflows. We propose an open-source supervised and alignment-free subtyping method (Kameris) that operates on k-mer frequencies in HIV-1 sequences. We performed a detailed study of the accuracy and performance of subtype classification in comparison to four state-of-the-art programs. Based on our testing data set of manually curated real-world HIV-1 sequences (n = 2, 784), Kameris obtained an overall accuracy of 97%, which matches or exceeds all other tested software, with a processing rate of over 1,500 sequences per second. Furthermore, our fully standalone general-purpose software provides key advantages in terms of data security and privacy, transparency and reproducibility. Finally, we show that our method is readily adaptable to subtype classification of other viruses including dengue, influenza A, and hepatitis B and C virus.

Assuntos

Genoma Viral/genética , Genômica/métodos , HIV-1/genética , Aprendizado de Máquina , Software , Fatores de Tempo

12.

MoDMaps3D: an interactive webtool for the quantification and 3D visualization of interrelationships in a dataset of DNA sequences.

Karamichalis, Rallis; Kari, Lila.

Bioinformatics ; 33(19): 3091-3093, 2017 Oct 01.

Artigo em Inglês | MEDLINE | ID: mdl-28605460

RESUMO

SUMMARY: MoDMaps3D (Molecular Distance Maps 3D) is an alignment-free, fast, computationally lightweight webtool for computing and visualizing the interrelationships within any dataset of DNA sequences, based on pairwise comparisons between their oligomer compositions. MoDMaps3D is a general-purpose interactive webtool that is free of any requirements on sequence composition, position of the sequences in their respective genomes, presence or absence of similarity or homology, sequence length, or even sequence origin (biological or computer-generated). AVAILABILITY AND IMPLEMENTATION: MoDMaps3D is open source, cross-platform compatible, and is available under the MIT license at http://moleculardistancemaps.github.io/MoDMaps3D/. The source code is available at https://github.com/moleculardistancemaps/MoDMaps3D/. CONTACT: lila@uwaterloo.ca. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

DNA/química , Análise de Sequência de DNA/métodos , Software , Gráficos por Computador , Internet

13.

Additive methods for genomic signatures.

Karamichalis, Rallis; Kari, Lila; Konstantinidis, Stavros; Kopecki, Steffen; Solis-Reyes, Stephen.

BMC Bioinformatics ; 17(1): 313, 2016 Aug 22.

Artigo em Inglês | MEDLINE | ID: mdl-27549194

RESUMO

BACKGROUND: Studies exploring the potential of Chaos Game Representations (CGR) of genomic sequences to act as "genomic signatures" (to be species- and genome-specific) showed that CGR patterns of nuclear and organellar DNA sequences of the same organism can be very different. While the hypothesis that CGRs of mitochondrial DNA sequences can act as genomic signatures was validated for a snapshot of all sequenced mitochondrial genomes available in the NCBI GenBank sequence database, to our knowledge no such extensive analysis of CGRs of nuclear DNA sequences exists to date. RESULTS: We analyzed an extensive dataset, totalling 1.45 gigabase pairs, of nuclear/nucleoid genomic sequences (nDNA) from 42 different organisms, spanning all major kingdoms of life. Our computational experiments indicate that CGR signatures of nDNA of two different origins cannot always be differentiated, especially if they originate from closely-related species such as H. sapiens and P. troglodytes or E. coli and E. fergusonii. To address this issue, we propose the general concept of additive DNA signature of a set (collection) of DNA sequences. One particular instance, the composite DNA signature, combines information from nDNA fragments and organellar (mitochondrial, chloroplast, or plasmid) genomes. We demonstrate that, in this dataset, composite DNA signatures originating from two different organisms can be differentiated in all cases, including those where the use of CGR signatures of nDNA failed or was inconclusive. Another instance, the assembled DNA signature, combines information from many short DNA subfragments (e.g., 100 basepairs) of a given DNA fragment, to produce its signature. We show that an assembled DNA signature has the same distinguishing power as a conventionally computed CGR signature, while using shorter contiguous sequences and potentially less sequence information. CONCLUSIONS: Our results suggest that, while CGR signatures of nDNA cannot always play the role of genomic signatures, composite and assembled DNA signatures (separately or in combination) could potentially be used instead. Such additive signatures could be used, e.g., with raw unassembled next-generation sequencing (NGS) read data, when high-quality sequencing data is not available, or to complement information obtained by other methods of species identification or classification.

Assuntos

Genoma Bacteriano , Genômica/métodos , Animais , Bactérias/classificação , Bactérias/genética , DNA Bacteriano/genética , DNA Mitocondrial/genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos

14.

An investigation into inter- and intragenomic variations of graphic genomic signatures.

Karamichalis, Rallis; Kari, Lila; Konstantinidis, Stavros; Kopecki, Steffen.

BMC Bioinformatics ; 16: 246, 2015 Aug 07.

Artigo em Inglês | MEDLINE | ID: mdl-26249837

RESUMO

BACKGROUND: Motivated by the general need to identify and classify species based on molecular evidence, genome comparisons have been proposed that are based on measuring mostly Euclidean distances between Chaos Game Representation (CGR) patterns of genomic DNA sequences. RESULTS: We provide, on an extensive dataset and using several different distances, confirmation of the hypothesis that CGR patterns are preserved along a genomic DNA sequence, and are different for DNA sequences originating from genomes of different species. This finding lends support to the theory that CGRs of genomic sequences can act as graphic genomic signatures. In particular, we compare the CGR patterns of over five hundred different 150,000 bp genomic sequences spanning one complete chromosome from each of six organisms, representing all kingdoms of life: H. sapiens (Animalia; chromosome 21), S. cerevisiae (Fungi; chromosome 4), A. thaliana (Plantae; chromosome 1), P. falciparum (Protista; chromosome 14), E. coli (Bacteria - full genome), and P. furiosus (Archaea - full genome). To maximize the diversity within each species, we also analyze the interrelationships within a set of over five hundred 150,000 bp genomic sequences sampled from the entire aforementioned genomes. Lastly, we provide some preliminary evidence of this method's ability to classify genomic DNA sequences at lower taxonomic levels by comparing sequences sampled from the entire genome of H. sapiens (class Mammalia, order Primates) and of M. musculus (class Mammalia, order Rodentia), for a total length of approximately 174 million basepairs analyzed. We compute pairwise distances between CGRs of these genomic sequences using six different distances, and construct Molecular Distance Maps, which visualize all sequences as points in a two-dimensional or three-dimensional space, to simultaneously display their interrelationships. CONCLUSION: Our analysis confirms, for this dataset, that CGR patterns of DNA sequences from the same genome are in general quantitatively similar, while being different for DNA sequences from genomes of different species. Our assessment of the performance of the six distances analyzed uses three different quality measures and suggests that several distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies.

Assuntos

Algoritmos , Gráficos por Computador , DNA/química , DNA/genética , Genoma , Genômica/métodos , Análise de Sequência de DNA/métodos , Animais , Arabidopsis/genética , Escherichia coli/genética , Humanos , Plasmodium falciparum/genética , Saccharomyces cerevisiae/genética

15.

Mapping the space of genomic signatures.

Kari, Lila; Hill, Kathleen A; Sayem, Abu S; Karamichalis, Rallis; Bryans, Nathaniel; Davis, Katelyn; Dattani, Nikesh S.

PLoS One ; 10(5): e0119815, 2015.

Artigo em Inglês | MEDLINE | ID: mdl-26000734

RESUMO

We propose a computational method to measure and visualize interrelationships among any number of DNA sequences allowing, for example, the examination of hundreds or thousands of complete mitochondrial genomes. An "image distance" is computed for each pair of graphical representations of DNA sequences, and the distances are visualized as a Molecular Distance Map: Each point on the map represents a DNA sequence, and the spatial proximity between any two points reflects the degree of structural similarity between the corresponding sequences. The graphical representation of DNA sequences utilized, Chaos Game Representation (CGR), is genome- and species-specific and can thus act as a genomic signature. Consequently, Molecular Distance Maps could inform species identification, taxonomic classifications and, to a certain extent, evolutionary history. The image distance employed, Structural Dissimilarity Index (DSSIM), implicitly compares the occurrences of oligomers of length up to k (herein k = 9) in DNA sequences. We computed DSSIM distances for more than 5 million pairs of complete mitochondrial genomes, and used Multi-Dimensional Scaling (MDS) to obtain Molecular Distance Maps that visually display the sequence relatedness in various subsets, at different taxonomic levels. This general-purpose method does not require DNA sequence alignment and can thus be used to compare similar or vastly different DNA sequences, genomic or computer-generated, of the same or different lengths. We illustrate potential uses of this approach by applying it to several taxonomic subsets: phylum Vertebrata, (super)kingdom Protista, classes Amphibia-Insecta-Mammalia, class Amphibia, and order Primates. This analysis of an extensive dataset confirms that the oligomer composition of full mtDNA sequences can be a source of taxonomic information. This method also correctly finds the mtDNA sequences most closely related to that of the anatomically modern human (the Neanderthal, the Denisovan, and the chimp), and that the sequence most different from it in this dataset belongs to a cucumber.

Assuntos

DNA Mitocondrial/genética , Modelos Teóricos , Animais

16.

Geometrical tile design for complex neighborhoods.

Czeizler, Eugen; Kari, Lila.

Front Comput Neurosci ; 3: 20, 2009.

Artigo em Inglês | MEDLINE | ID: mdl-19956398

RESUMO

Recent research has showed that tile systems are one of the most suitable theoretical frameworks for the spatial study and modeling of self-assembly processes, such as the formation of DNA and protein oligomeric structures. A Wang tile is a unit square, with glues on its edges, attaching to other tiles and forming larger and larger structures. Although quite intuitive, the idea of glues placed on the edges of a tile is not always natural for simulating the interactions occurring in some real systems. For example, when considering protein self-assembly, the shape of a protein is the main determinant of its functions and its interactions with other proteins. Our goal is to use geometric tiles, i.e., square tiles with geometrical protrusions on their edges, for simulating tiled paths (zippers) with complex neighborhoods, by ribbons of geometric tiles with simple, local neighborhoods. This paper is a step toward solving the general case of an arbitrary neighborhood, by proposing geometric tile designs that solve the case of a "tall" von Neumann neighborhood, the case of the f-shaped neighborhood, and the case of a 3 x 5 "filled" rectangular neighborhood. The techniques can be combined and generalized to solve the problem in the case of any neighborhood, centered at the tile of reference, and included in a 3 x (2k + 1) rectangle.

17.

The spectrum of genomic signatures: from dinucleotides to chaos game representation.

Wang, Yingwei; Hill, Kathleen; Singh, Shiva; Kari, Lila.

Gene ; 346: 173-85, 2005 Feb 14.

Artigo em Inglês | MEDLINE | ID: mdl-15716010

RESUMO

In the post genomic era, access to complete genome sequence data for numerous diverse species has opened multiple avenues for examining and comparing primary DNA sequence organization of entire genomes. Previously, the concept of a genomic signature was introduced with the observation of species-type specific Dinucleotide Relative Abundance Profiles (DRAPs); dinucleotides were identified as the subsequences with the greatest bias in representation in a majority of genomes. Herein, we demonstrate that DRAP is one particular genomic signature contained within a broader spectrum of signatures. Within this spectrum, an alternative genomic signature, Chaos Game Representation (CGR), provides a unique visualization of patterns in sequence organization. A genomic signature is associated with a particular integer order or subsequence length that represents a measure of the resolution or granularity in the analysis of primary DNA sequence organization. We quantitatively explore the organizational information provided by genomic signatures of different orders through different distance measures, including a novel Image Distance. The Image Distance and other existing distance measures are evaluated by comparing the phylogenetic trees they generate for 26 complete mitochondrial genomes from a diversity of species. The phylogenetic tree generated by the Image Distance is compatible with the known relatedness of species. Quantitative evaluation of the spectrum of genomic signatures may be used to ultimately gain insight into the determinants and biological relevance of the genome signatures.

Assuntos

Nucleotídeos/genética , Sequência de Bases , DNA/genética , Primers do DNA , Filogenia

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA