Pesquisa | Portal Regional da BVS

Learning vector quantization as an interpretable classifier for the detection of SARS-CoV-2 types based on their RNA sequences.

Kaden, Marika; Bohnsack, Katrin Sophie; Weber, Mirko; Kudla, Mateusz; Gutowska, Kaja; Blazewicz, Jacek; Villmann, Thomas.

Neural Comput Appl ; 34(1): 67-78, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-33935376

RESUMO

We present an approach to discriminate SARS-CoV-2 virus types based on their RNA sequence descriptions avoiding a sequence alignment. For that purpose, sequences are preprocessed by feature extraction and the resulting feature vectors are analyzed by prototype-based classification to remain interpretable. In particular, we propose to use variants of learning vector quantization (LVQ) based on dissimilarity measures for RNA sequence data. The respective matrix LVQ provides additional knowledge about the classification decisions like discriminant feature correlations and, additionally, can be equipped with easy to realize reject options for uncertain data. Those options provide self-controlled evidence, i.e., the model refuses to make a classification decision if the model evidence for the presented data is not sufficient. This model is first trained using a GISAID dataset with given virus types detected according to the molecular differences in coronavirus populations by phylogenetic tree clustering. In a second step, we apply the trained model to another but unlabeled SARS-CoV-2 virus dataset. For these data, we can either assign a virus type to the sequences or reject atypical samples. Those rejected sequences allow to speculate about new virus types with respect to nucleotide base mutations in the viral sequences. Moreover, this rejection analysis improves model robustness. Last but not least, the presented approach has lower computational complexity compared to methods based on (multiple) sequence alignment. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s00521-021-06018-2.

Virxicon: a lexicon of viral sequences.

Kudla, Mateusz; Gutowska, Kaja; Synak, Jaroslaw; Weber, Mirko; Bohnsack, Katrin Sophie; Lukasiak, Piotr; Villmann, Thomas; Blazewicz, Jacek; Szachniuk, Marta.

Bioinformatics ; 36(22-23): 5507-5513, 2021 Apr 01.

Artigo em Inglês | MEDLINE | ID: mdl-33367605

RESUMO

MOTIVATION: Viruses are the most abundant biological entities and constitute a large reservoir of genetic diversity. In recent years, knowledge about them has increased significantly as a result of dynamic development in life sciences and rapid technological progress. This knowledge is scattered across various data repositories, making a comprehensive analysis of viral data difficult. RESULTS: In response to the need for gathering a comprehensive knowledge of viruses and viral sequences, we developed Virxicon, a lexicon of all experimentally acquired sequences for RNA and DNA viruses. The ability to quickly obtain data for entire viral groups, searching sequences by levels of taxonomic hierarchy-according to the Baltimore classification and ICTV taxonomy-and tracking the distribution of viral data and its growth over time are unique features of our database compared to the other tools. AVAILABILITYAND IMPLEMENTATION: Virxicon is a publicly available resource, updated weekly. It has an intuitive web interface and can be freely accessed at http://virxicon.cs.put.poznan.pl/.

Detecting life signatures with RNA sequence similarity measures.

Wasik, Szymon; Szostak, Natalia; Kudla, Mateusz; Wachowiak, Michal; Krawiec, Krzysztof; Blazewicz, Jacek.

J Theor Biol ; 463: 110-120, 2019 02 21.

Artigo em Inglês | MEDLINE | ID: mdl-30562502

RESUMO

The RNA World is currently the most plausible hypothesis for explaining the origins of life on Earth. The supporting body of evidence is growing and it comes from multiple areas, including astrobiology, chemistry, biology, mathematics, and, in particular, from computer simulations. Such methods frequently assume the existence of a hypothetical species on Earth, around three billion years ago, with a base sequence probably dissimilar from any in known genomes. However, it is often hard to verify whether or not a hypothetical sequence has the characteristics of biological sequences, and is thus likely to be functional. The primary objective of the presented research was to verify the possibility of building a computational 'life probe' for determining whether a given genetic sequence is biological, and assessing the sensitivity of such probes to the signatures of life present in known biological sequences. We have proposed decision algorithms based on the normalized compression distance (NCD) and Levenshtein distance (LD). We have validated the proposed method in the context of the RNA World hypothesis using short genetic sequences shorter than the error threshold value (i.e., 100 nucleotides). We have demonstrated that both measures can be successfully used to construct life probes that are significantly better than a random decision procedure, while varying from each other when it comes to detailed characteristics. We also observed that fragments of sequences related to replication have better discriminatory power than sequences having other molecular functions. In a broader context, this shows that the signatures of life in short RNA samples can be effectively detected using relatively simple means.

Assuntos

Origem da Vida , RNA/genética , Algoritmos , Sequência de Bases , Simulação por Computador , RNA/fisiologia , Reprodução/genética

RESUMO

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA