Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 28
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Comput Struct Biotechnol J ; 23: 2326-2336, 2024 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-38867722

RESUMO

Molecular encodings and their usage in machine learning models have demonstrated significant breakthroughs in biomedical applications, particularly in the classification of peptides and proteins. To this end, we propose a new encoding method: Interpretable Carbon-based Array of Neighborhoods (iCAN). Designed to address machine learning models' need for more structured and less flexible input, it captures the neighborhoods of carbon atoms in a counting array and improves the utility of the resulting encodings for machine learning models. The iCAN method provides interpretable molecular encodings and representations, enabling the comparison of molecular neighborhoods, identification of repeating patterns, and visualization of relevance heat maps for a given data set. When reproducing a large biomedical peptide classification study, it outperforms its predecessor encoding. When extended to proteins, it outperforms a lead structure-based encoding on 71% of the data sets. Our method offers interpretable encodings that can be applied to all organic molecules, including exotic amino acids, cyclic peptides, and larger proteins, making it highly versatile across various domains and data sets. This work establishes a promising new direction for machine learning in peptide and protein classification in biomedicine and healthcare, potentially accelerating advances in drug discovery and disease diagnosis.

2.
Comput Methods Programs Biomed ; 242: 107843, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37832432

RESUMO

OBJECTIVE: Evaluating the performance of multiple complex models, such as those found in biology, medicine, climatology, and machine learning, using conventional approaches is often challenging when using various evaluation metrics simultaneously. The traditional approach, which relies on presenting multi-model evaluation scores in the table, presents an obstacle when determining the similarities between the models and the order of performance. METHODS: By combining statistics, information theory, and data visualization, juxtaposed Taylor and Mutual Information Diagrams permit users to track and summarize the performance of one model or a collection of different models. To uncover linear and nonlinear relationships between models, users may visualize one or both charts. RESULTS: Our library presents the first publicly available implementation of the Mutual Information Diagram and its new interactive capabilities, as well as the first publicly available implementation of an interactive Taylor Diagram. Extensions have been implemented so that both diagrams can display temporality, multimodality, and multivariate data sets, and feature one scalar model property such as uncertainty. Our library, named polar-diagrams, supports both continuous and categorical attributes. CONCLUSION: The library can be used to quickly and easily assess the performances of complex models, such as those found in machine learning, climate, or biomedical domains.

3.
Comput Struct Biotechnol J ; 21: 1448-1460, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36851917

RESUMO

In an ever-growing need for data storage capacity, the Deoxyribonucleic Acid (DNA) molecule gains traction as a new storage medium with a larger capacity, higher density, and a longer lifespan over conventional storage media. To effectively use DNA for data storage, it is important to understand the different methods of encoding information in DNA and compare their effectiveness. This requires evaluating which decoded DNA sequences carry the most encoded information based on various attributes. However, navigating the field of coding theory requires years of experience and domain expertise. For instance, domain experts rely on various mathematical functions and attributes to score and evaluate their encodings. To enable such analytical tasks, we provide an interactive and visual analytical framework for multi-attribute ranking in DNA storage systems. Our framework follows a three-step view with user-settable parameters. It enables users to find the optimal en-/de-coding approaches by setting different weights and combining multiple attributes. We assess the validity of our work through a task-specific user study on domain experts by relying on three tasks. Results indicate that all participants completed their tasks successfully under two minutes, then rated the framework for design choices, perceived usefulness, and intuitiveness. In addition, two real-world use cases are shared and analyzed as direct applications of the proposed tool. DNAsmart enables the ranking of decoded sequences based on multiple attributes. In sum, this work unveils the evaluation of en-/de-coding approaches accessible and tractable through visualization and interactivity to solve comparison and ranking tasks.

4.
NAR Genom Bioinform ; 5(1): lqac103, 2023 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-36632611

RESUMO

Exploring new ways to represent and discover organic molecules is critical to the development of new therapies. Fingerprinting algorithms are used to encode or machine-read organic molecules. Molecular encodings facilitate the computation of distance and similarity measurements to support tasks such as similarity search or virtual screening. Motivated by the ubiquity of carbon and the emerging structured patterns, we propose a parametric approach for molecular encodings using carbon-based multilevel atomic neighborhoods. It implements a walk along the carbon chain of a molecule to compute different representations of the neighborhoods in the form of a binary or numerical array that can later be exported into an image. Applied to the task of binary peptide classification, the evaluation was performed by using forty-nine encodings of twenty-nine data sets from various biomedical fields, resulting in well over 1421 machine learning models. By design, the parametric approach is domain- and task-agnostic and scopes all organic molecules including unnatural and exotic amino acids as well as cyclic peptides. Applied to peptide classification, our results point to a number of promising applications and extensions. The parametric approach was developed as a Python package (cmangoes), the source code and documentation of which can be found at https://github.com/ghattab/cmangoes and https://doi.org/10.5281/zenodo.7483771.

5.
Mater Today Bio ; 15: 100306, 2022 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-35677811

RESUMO

Deoxyribonucleic acid (DNA) is increasingly emerging as a serious medium for long-term archival data storage because of its remarkable high-capacity, high-storage-density characteristics and its lasting ability to store data for thousands of years. Various encoding algorithms are generally required to store digital information in DNA and to maintain data integrity. Indeed, since DNA is the information carrier, its performance under different processing and storage conditions significantly impacts the capabilities of the data storage system. Therefore, the design of a DNA storage system must meet specific design considerations to be less error-prone, robust and reliable. In this work, we summarize the general processes and technologies employed when using synthetic DNA as a storage medium. We also share the design considerations for sustainable engineering to include viability. We expect this work to provide insight into how sustainable design can be used to develop an efficient and robust synthetic DNA-based storage system for long-term archiving.

6.
Front Genet ; 13: 891240, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35664339

RESUMO

Sustained efforts in next-generation sequencing technologies are changing the field of taxonomy. The increase in the number of resolved genomes has made the traditional taxonomy of species antiquated. With phylogeny-based methods, taxonomies are being updated and refined. Although such methods bridge the gap between phylogeny and taxonomy, phylogeny-based taxonomy currently lacks interactive visualization approaches. Motivated by enriching and increasing the consistency of evolutionary and taxonomic studies alike, we propose Context-Aware Phylogenetic Trees (CAPT) as an interactive web tool to support users in exploration- and validation-based tasks. To complement phylogenetic information with phylogeny-based taxonomy, we offer linking two interactive visualizations which compose two simultaneous views: the phylogenetic tree view and the taxonomic icicle view. Thanks to its space-filling properties, the icicle visualization follows the intuition behind taxonomies where different hierarchical rankings with equal number of child elements can be represented with same-sized rectangular areas. In other words, it provides partitions of different sizes depending on the number of elements they contain. The icicle view integrates seven taxonomic rankings: domain, phylum, class, order, family, genus, and species. CAPT enriches the clades in the phylogenetic tree view with context from the genomic data and supports interactive techniques such as linking and brushing to highlight correspondence between the two views. Four different use cases, extracted from the Genome Taxonomy DataBase, were employed to create four scenarios using our approach. CAPT was successfully used to explore the phylogenetic trees as well as the taxonomic data by providing context and using the interaction techniques. This tool is essential to increase the accuracy of categorization of newly identified species and validate updated taxonomies. The source code and data are freely available at https://github.com/ghattab/CAPT.

7.
Comput Struct Biotechnol J ; 20: 1044-1055, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-35284047

RESUMO

Thanks to recent advances in sequencing and computational technologies, many researchers with biological and/or medical backgrounds are now producing multiple data sets with an embedded temporal dimension. Multi-modalities enable researchers to explore and investigate different biological and physico-chemical processes with various technologies. Motivated to explore multi-omics data and time-series multi-omics specifically, the exploration process has been hindered by the separation introduced by each omics-type. To effectively explore such temporal data sets, discover anomalies, find patterns, and better understand their intricacies, expertise in computer science and bioinformatics is required. Here we present MOVIS, a modular time-series multi-omics exploration tool with a user-friendly web interface that facilitates the data exploration of such data. It brings into equal participation each time-series omic-type for analysis and visualization. As of the time of writing, two time-series multi-omics data sets have been integrated and successfully reproduced. The resulting visualizations are task-specific, reproducible, and publication-ready. MOVIS is built on open-source software and is easily extendable to accommodate different analytical tasks. An online version of MOVIS is available under https://movis.mathematik.uni-marburg.de/ and on Docker Hub (https://hub.docker.com/r/aanzel/movis).

8.
Sci Total Environ ; 825: 153732, 2022 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-35157872

RESUMO

Microbes are essential for element cycling and ecosystem functioning. However, many questions central to understanding the role of microbes in ecology are still open. Here, we analyze the relationship between lake microbiomes and the lakes' land cover. By applying machine learning methods, we quantify the covariance between land cover categories and the microbial community composition recorded in the largest amplicon sequencing dataset of European lakes available to date. Our results show that the aggregation of environmental features or microbial taxa before analysis can obscure ecologically relevant patterns. We observe a comparatively high covariation of the lakes' microbial community with herbaceous and open spaces surrounding the lake; nevertheless, the microbial covariation with land cover categories is generally lower than the covariation with physico-chemical parameters. Combining land cover and physico-chemical bioindicators identified from the same amplicon sequencing dataset, we develop analytical data structures that facilitate insights into the ecology of the lake microbiome. Among these, a list of the environmental parameters sorted by the number of microbial bioindicators we have identified for them points towards apparent environmental drivers of the lake microbial community composition, such as the altitude, conductivity, and area covered herbaceous vegetation surrounding the lake. Furthermore, the response map, a similarity matrix calculated from the Jaccard similarity of the environmental parameters' lists of bioindicators, allows us to study the ecosystem's structure from the standpoint of the microbiome. More specifically, we identify multiple clusters of highly similar and possibly functionally linked ecological parameters, including one that highlights the importance of the calcium-bicarbonate equilibrium for lake ecology. Taken together, we demonstrate the use of machine learning approaches in studying the interplay between microbial diversity and environmental factors and introduce novel approaches to integrate environmental molecular diversity into monitoring and water quality assessments.


Assuntos
Lagos , Microbiota , Biomarcadores Ambientais , Qualidade da Água
9.
Nucleic Acids Res ; 50(5): e30, 2022 03 21.
Artigo em Inglês | MEDLINE | ID: mdl-34908135

RESUMO

The use of complex biological molecules to solve computational problems is an emerging field at the interface between biology and computer science. There are two main categories in which biological molecules, especially DNA, are investigated as alternatives to silicon-based computer technologies. One is to use DNA as a storage medium, and the other is to use DNA for computing. Both strategies come with certain constraints. In the current study, we present a novel approach derived from chaos game representation for DNA to generate DNA code words that fulfill user-defined constraints, namely GC content, homopolymers, and undesired motifs, and thus, can be used to build codes for reliable DNA storage systems.


Assuntos
Biologia Computacional/métodos , DNA , Fractais
10.
Comput Struct Biotechnol J ; 19: 5504-5509, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34712396

RESUMO

Due to the highly growing number of available genomic information, the need for accessible and easy-to-use analysis tools is increasing. To facilitate eukaryotic genome annotations, we created MOSGA. In this work, we show how MOSGA 2 is developed by including several advanced analyses for genomic data. Since the genomic data quality greatly impacts the annotation quality, we included multiple tools to validate and ensure high-quality user-submitted genome assemblies. Moreover, thanks to the integration of comparative genomics methods, users can benefit from a broader genomic view by analyzing multiple genomic data sets simultaneously. Further, we demonstrate the new functionalities of MOSGA 2 by different use-cases and practical examples. MOSGA 2 extends the already established application to the quality control of the genomic data and integrates and analyzes multiple genomes in a larger context, e.g., by phylogenetics.

11.
Comput Struct Biotechnol J ; 19: 4904-4918, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34527195

RESUMO

About fifty times more data has been created than there are stars in the observable universe. Current trends in data creation and consumption mean that the devices and storage media we use will require more physical space. Novel data storage media such as DNA are considered a viable alternative. Yet, the introduction of new storage technologies should be accompanied by an evaluation of user requirements. To assess such needs, we designed and conducted a survey to rank different storage properties adapted for visualization. That is, accessibility, capacity, usage, mutability, lifespan, addressability, and typology. Withal, we reported different storage devices over time while ranking them by their properties. Our results indicated a timeline of three distinct periods: magnetic, optical and electronic, and alternative media. Moreover, by investigating user interfaces across different operating systems, we observed a predominant presence of bar charts and tree maps for the usage of a medium and its file directory hierarchy, respectively. Taken together with the results of our survey, this allowed us to create a customized user interface that includes data visualizations that can be toggled for both user groups: Experts and Public.

12.
BMC Med Imaging ; 21(1): 119, 2021 08 05.
Artigo em Inglês | MEDLINE | ID: mdl-34353290

RESUMO

BACKGROUND: Object detection and image segmentation of regions of interest provide the foundation for numerous pipelines across disciplines. Robust and accurate computer vision methods are needed to properly solve image-based tasks. Multiple algorithms have been developed to solely detect edges in images. Constrained to the problem of creating a thin, one-pixel wide, edge from a predicted object boundary, we require an algorithm that removes pixels while preserving the topology. Thanks to skeletonize algorithms, an object boundary is transformed into an edge; contrasting uncertainty with exact positions. METHODS: To extract edges from boundaries generated from different algorithms, we present a computational pipeline that relies on: a novel skeletonize algorithm, a non-exhaustive discrete parameter search to find the optimal parameter combination of a specific post-processing pipeline, and an extensive evaluation using three data sets from the medical and natural image domains (kidney boundaries, NYU-Depth V2, BSDS 500). While the skeletonize algorithm was compared to classical topological skeletons, the validity of our post-processing algorithm was evaluated by integrating the original post-processing methods from six different works. RESULTS: Using the state of the art metrics, precision and recall based Signed Distance Error (SDE) and the Intersection over Union bounding box (IOU-box), our results indicate that the SDE metric for these edges is improved up to 2.3 times. CONCLUSIONS: Our work provides guidance for parameter tuning and algorithm selection in the post-processing of predicted object boundaries.


Assuntos
Algoritmos , Diagnóstico por Imagem/métodos , Processamento de Imagem Assistida por Computador/métodos , Cirurgia Assistida por Computador , Humanos
13.
Sci Rep ; 11(1): 13440, 2021 06 29.
Artigo em Inglês | MEDLINE | ID: mdl-34188080

RESUMO

Recent technological advances have made Virtual Reality (VR) attractive in both research and real world applications such as training, rehabilitation, and gaming. Although these other fields benefited from VR technology, it remains unclear whether VR contributes to better spatial understanding and training in the context of surgical planning. In this study, we evaluated the use of VR by comparing the recall of spatial information in two learning conditions: a head-mounted display (HMD) and a desktop screen (DT). Specifically, we explored (a) a scene understanding and then (b) a direction estimation task using two 3D models (i.e., a liver and a pyramid). In the scene understanding task, participants had to navigate the rendered the 3D models by means of rotation, zoom and transparency in order to substantially identify the spatial relationships among its internal objects. In the subsequent direction estimation task, participants had to point at a previously identified target object, i.e., internal sphere, on a materialized 3D-printed version of the model using a tracked pointing tool. Results showed that the learning condition (HMD or DT) did not influence participants' memory and confidence ratings of the models. In contrast, the model type, that is, whether the model to be recalled was a liver or a pyramid significantly affected participants' memory about the internal structure of the model. Furthermore, localizing the internal position of the target sphere was also unaffected by participants' previous experience of the model via HMD or DT. Overall, results provide novel insights on the use of VR in a surgical planning scenario and have paramount implications in medical learning by shedding light on the mental model we make to recall spatial structures.

14.
NAR Genom Bioinform ; 3(2): lqab039, 2021 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-34046590

RESUMO

Owing to the great variety of distinct peptide encodings, working on a biomedical classification task at hand is challenging. Researchers have to determine encodings capable to represent underlying patterns as numerical input for the subsequent machine learning. A general guideline is lacking in the literature, thus, we present here the first large-scale comprehensive study to investigate the performance of a wide range of encodings on multiple datasets from different biomedical domains. For the sake of completeness, we added additional sequence- and structure-based encodings. In particular, we collected 50 biomedical datasets and defined a fixed parameter space for 48 encoding groups, leading to a total of 397 700 encoded datasets. Our results demonstrate that none of the encodings are superior for all biomedical domains. Nevertheless, some encodings often outperform others, thus reducing the initial encoding selection substantially. Our work offers researchers to objectively compare novel encodings to the state of the art. Our findings pave the way for a more sophisticated encoding optimization, for example, as part of automated machine learning pipelines. The work presented here is implemented as a large-scale, end-to-end workflow designed for easy reproducibility and extensibility. All standardized datasets and results are available for download to comply with FAIR standards.

15.
Sci Rep ; 11(1): 8134, 2021 04 14.
Artigo em Inglês | MEDLINE | ID: mdl-33854157

RESUMO

Predicting if a set of mushrooms is edible or not corresponds to the task of classifying them into two groups-edible or poisonous-on the basis of a classification rule. To support this binary task, we have collected the largest and most comprehensive attribute based data available. In this work, we detail the creation, curation and simulation of a data set for binary classification. Thanks to natural language processing, the primary data are based on a text book for mushroom identification and contain 173 species from 23 families. While the secondary data comprise simulated or hypothetical entries that are structurally comparable to the 1987 data, it serves as pilot data for classification tasks. We evaluated different machine learning algorithms, namely, naive Bayes, logistic regression, and linear discriminant analysis (LDA), and random forests (RF). We found that the RF provided the best results with a five-fold Cross-Validation accuracy and F2-score of 1.0 ([Formula: see text], [Formula: see text]), respectively. The results of our pilot are conclusive and indicate that our data were not linearly separable. Unlike the 1987 data which showed good results using a linear decision boundary with the LDA. Our data set contains 23 families and is the largest available. We further provide a fully reproducible workflow and provide the data under the FAIR principles.


Assuntos
Agaricales/classificação , Curadoria de Dados/métodos , Algoritmos , Teorema de Bayes , Simulação por Computador , Bases de Dados Factuais , Análise Discriminante , Modelos Logísticos , Aprendizado de Máquina , Projetos Piloto
16.
PLoS Comput Biol ; 17(4): e1008901, 2021 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-33822781

RESUMO

[This corrects the article DOI: 10.1371/journal.pcbi.1008259.].

17.
Mol Ecol ; 30(9): 2131-2144, 2021 05.
Artigo em Inglês | MEDLINE | ID: mdl-33682183

RESUMO

It is known that microorganisms are essential for the functioning of ecosystems, but the extent to which microorganisms respond to different environmental variables in their natural habitats is not clear. In the current study, we present a methodological framework to quantify the covariation of the microbial community of a habitat and environmental variables of this habitat. It is built on theoretical considerations of systems ecology, makes use of state-of-the-art machine learning techniques and can be used to identify bioindicators. We apply the framework to a data set containing operational taxonomic units (OTUs) as well as more than twenty physicochemical and geographic variables measured in a large-scale survey of European lakes. While a large part of variation (up to 61%) in many environmental variables can be explained by microbial community composition, some variables do not show significant covariation with the microbial lake community. Moreover, we have identified OTUs that act as "multitask" bioindicators, i.e., that are indicative for multiple environmental variables, and thus could be candidates for lake water monitoring schemes. Our results represent, for the first time, a quantification of the covariation of the lake microbiome and a wide array of environmental variables for lake ecosystems. Building on the results and methodology presented here, it will be possible to identify microbial taxa and processes that are essential for functioning and stability of lake ecosystems.


Assuntos
Lagos , Microbiota , Ecologia , Aprendizado de Máquina , Microbiota/genética
18.
Brief Bioinform ; 22(2): 642-663, 2021 03 22.
Artigo em Inglês | MEDLINE | ID: mdl-33147627

RESUMO

SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) is a novel virus of the family Coronaviridae. The virus causes the infectious disease COVID-19. The biology of coronaviruses has been studied for many years. However, bioinformatics tools designed explicitly for SARS-CoV-2 have only recently been developed as a rapid reaction to the need for fast detection, understanding and treatment of COVID-19. To control the ongoing COVID-19 pandemic, it is of utmost importance to get insight into the evolution and pathogenesis of the virus. In this review, we cover bioinformatics workflows and tools for the routine detection of SARS-CoV-2 infection, the reliable analysis of sequencing data, the tracking of the COVID-19 pandemic and evaluation of containment measures, the study of coronavirus evolution, the discovery of potential drug targets and development of therapeutic strategies. For each tool, we briefly describe its use case and how it advances research specifically for SARS-CoV-2. All tools are free to use and available online, either through web applications or public code repositories. Contact:evbc@unj-jena.de.


Assuntos
COVID-19/prevenção & controle , Biologia Computacional , SARS-CoV-2/isolamento & purificação , Pesquisa Biomédica , COVID-19/epidemiologia , COVID-19/virologia , Genoma Viral , Humanos , Pandemias , SARS-CoV-2/genética
19.
Bioinformatics ; 36(22-23): 5514-5515, 2021 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-33258916

RESUMO

MOTIVATION: The generation of high-quality assemblies, even for large eukaryotic genomes, has become a routine task for many biologists thanks to recent advances in sequencing technologies. However, the annotation of these assemblies-a crucial step toward unlocking the biology of the organism of interest-has remained a complex challenge that often requires advanced bioinformatics expertise. RESULTS: Here, we present MOSGA (Modular Open-Source Genome Annotator), a genome annotation framework for eukaryotic genomes with a user-friendly web-interface that generates and integrates annotations from various tools. The aggregated results can be analyzed with a fully integrated genome browser and are provided in a format ready for submission to NCBI. MOSGA is built on a portable, customizable and easily extendible Snakemake backend, and thus, can be tailored to a wide range of users and projects. AVAILABILITY AND IMPLEMENTATION: We provide MOSGA as a web service at https://mosga.mathematik.uni-marburg.de and as a docker container at registry.gitlab.com/mosga/mosga: latest. Source code can be found at https://gitlab.com/mosga/mosga. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma , Software , Eucariotos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...