Pesquisa | Portal Regional da BVS

Alignment-Free Sequence Comparison: A Systematic Survey From a Machine Learning Perspective.

Bohnsack, Katrin Sophie; Kaden, Marika; Abel, Julia; Villmann, Thomas.

IEEE/ACM Trans Comput Biol Bioinform ; 20(1): 119-135, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-34990369

RESUMO

The encounter of large amounts of biological sequence data generated during the last decades and the algorithmic and hardware improvements have offered the possibility to apply machine learning techniques in bioinformatics. While the machine learning community is aware of the necessity to rigorously distinguish data transformation from data comparison and adopt reasonable combinations thereof, this awareness is often lacking in the field of comparative sequence analysis. With realization of the disadvantages of alignments for sequence comparison, some typical applications use more and more so-called alignment-free approaches. In light of this development, we present a conceptual framework for alignment-free sequence comparison, which highlights the delineation of: 1) the sequence data transformation comprising of adequate mathematical sequence coding and feature generation, from 2) the subsequent (dis-)similarity evaluation of the transformed data by means of problem-specific but mathematically consistent proximity measures. We consider coding to be an information-loss free data transformation in order to get an appropriate representation, whereas feature generation is inevitably information-lossy with the intention to extract just the task-relevant information. This distinction sheds light on the plethora of methods available and assists in identifying suitable methods in machine learning and data analysis to compare the sequences under these premises.

Assuntos

Algoritmos , Aprendizado de Máquina , Alinhamento de Sequência , Análise de Sequência , Matemática

Learning vector quantization as an interpretable classifier for the detection of SARS-CoV-2 types based on their RNA sequences.

Kaden, Marika; Bohnsack, Katrin Sophie; Weber, Mirko; Kudla, Mateusz; Gutowska, Kaja; Blazewicz, Jacek; Villmann, Thomas.

Neural Comput Appl ; 34(1): 67-78, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-33935376

RESUMO

We present an approach to discriminate SARS-CoV-2 virus types based on their RNA sequence descriptions avoiding a sequence alignment. For that purpose, sequences are preprocessed by feature extraction and the resulting feature vectors are analyzed by prototype-based classification to remain interpretable. In particular, we propose to use variants of learning vector quantization (LVQ) based on dissimilarity measures for RNA sequence data. The respective matrix LVQ provides additional knowledge about the classification decisions like discriminant feature correlations and, additionally, can be equipped with easy to realize reject options for uncertain data. Those options provide self-controlled evidence, i.e., the model refuses to make a classification decision if the model evidence for the presented data is not sufficient. This model is first trained using a GISAID dataset with given virus types detected according to the molecular differences in coronavirus populations by phylogenetic tree clustering. In a second step, we apply the trained model to another but unlabeled SARS-CoV-2 virus dataset. For these data, we can either assign a virus type to the sequences or reject atypical samples. Those rejected sequences allow to speculate about new virus types with respect to nucleotide base mutations in the viral sequences. Moreover, this rejection analysis improves model robustness. Last but not least, the presented approach has lower computational complexity compared to methods based on (multiple) sequence alignment. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s00521-021-06018-2.

The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers.

Bohnsack, Katrin Sophie; Kaden, Marika; Abel, Julia; Saralajew, Sascha; Villmann, Thomas.

Entropy (Basel) ; 23(10)2021 Oct 17.

Artigo em Inglês | MEDLINE | ID: mdl-34682081

RESUMO

In the present article we propose the application of variants of the mutual information function as characteristic fingerprints of biomolecular sequences for classification analysis. In particular, we consider the resolved mutual information functions based on Shannon-, Rényi-, and Tsallis-entropy. In combination with interpretable machine learning classifier models based on generalized learning vector quantization, a powerful methodology for sequence classification is achieved which allows substantial knowledge extraction in addition to the high classification ability due to the model-inherent robustness. Any potential (slightly) inferior performance of the used classifier is compensated by the additional knowledge provided by interpretable models. This knowledge may assist the user in the analysis and understanding of the used data and considered task. After theoretical justification of the concepts, we demonstrate the approach for various example data sets covering different areas in biomolecular sequence analysis.

Virxicon: a lexicon of viral sequences.

Kudla, Mateusz; Gutowska, Kaja; Synak, Jaroslaw; Weber, Mirko; Bohnsack, Katrin Sophie; Lukasiak, Piotr; Villmann, Thomas; Blazewicz, Jacek; Szachniuk, Marta.

Bioinformatics ; 36(22-23): 5507-5513, 2021 Apr 01.

Artigo em Inglês | MEDLINE | ID: mdl-33367605

RESUMO

MOTIVATION: Viruses are the most abundant biological entities and constitute a large reservoir of genetic diversity. In recent years, knowledge about them has increased significantly as a result of dynamic development in life sciences and rapid technological progress. This knowledge is scattered across various data repositories, making a comprehensive analysis of viral data difficult. RESULTS: In response to the need for gathering a comprehensive knowledge of viruses and viral sequences, we developed Virxicon, a lexicon of all experimentally acquired sequences for RNA and DNA viruses. The ability to quickly obtain data for entire viral groups, searching sequences by levels of taxonomic hierarchy-according to the Baltimore classification and ICTV taxonomy-and tracking the distribution of viral data and its growth over time are unique features of our database compared to the other tools. AVAILABILITYAND IMPLEMENTATION: Virxicon is a publicly available resource, updated weekly. It has an intuitive web interface and can be freely accessed at http://virxicon.cs.put.poznan.pl/.

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA