Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 128
Filtrar
1.
Clin Nutr ESPEN ; 63: 311-321, 2024 Jul 02.
Artigo em Inglês | MEDLINE | ID: mdl-38964656

RESUMO

BACKGROUND AND AIMS: To investigate associations between Single Nucleotide Polymorphisms (SNPs) in the TAS1R and TAS2R taste receptors and diet quality, intake of alcohol, added sugar, and fat, using linear regression and machine learning techniques in a highly admixed population. METHODS: In the ISA-Capital health survey, 901 individuals were interviewed and had socioeconomic, demographic, health characteristics, along with dietary information obtained through two 24-h recalls. Data on 12 components related to food groups, nutrients, and calories was combined into a diet quality score (BHEI-R). BHEI-R, SoFAAs (calories from added sugar, saturated fat, and alcohol) and Alcohol use were tested for associations with 255 TAS2R SNPs and 73 TAS1R SNPs for 637 individuals with regression analysis and Random Forest. Significant SNPs were combined into Genetic taste scores (GTSs). RESULTS: Among 23 SNPs significantly associated either by stepwise linear/logistic regression or random forest with any possible biological functionality, the missense variants rs149217752 in TAS2R40, for SoFAAs, and rs2233997 in TAS2R4, were associated with both BHEI-R (under 4% increase in Mean Squared Error) and SoFAAs. GTSs increased the variance explanation of quantitative phenotypes and there was a moderately high AUC for alcohol use. CONCLUSIONS: The study provides insights into the genetic basis of human taste perception through the identification of missense variants in the TAS2R gene family. These findings may contribute to future strategies in precision nutrition aimed at improving food quality by reducing added sugar, saturated fat, and alcohol intake.

2.
Res Sq ; 2024 Jun 11.
Artigo em Inglês | MEDLINE | ID: mdl-38947037

RESUMO

Effective prevention of cardiac malformations, a leading cause of infant morbidity, is constrained by limited understanding of etiology. The study objective was to screen for associations between maternal and paternal characteristics and cardiac malformations. We selected 720,381 pregnancies linked to live-born infants (n=9,076 cardiac malformations) in 2011-2021 MarketScan US insurance claims data. Odds ratios were estimated with clinical diagnostic and medication codes using logistic regression. Screening of 2,000 associations selected 81 associated codes at the 5% false discovery rate. Grouping of selected codes, using latent semantic analysis and the Apriori-SD algorithm, identified elevated risk with known risk factors, including maternal diabetes and chronic hypertension. Less recognized potential signals included maternal fingolimod or azathioprine use. Signals identified might be explained by confounding, measurement error, and selection bias and warrant further investigation. The screening methods employed identified known risk factors, suggesting potential utility for identifying novel risk factors for other pregnancy outcomes.

3.
iScience ; 27(5): 109575, 2024 May 17.
Artigo em Inglês | MEDLINE | ID: mdl-38638577

RESUMO

DNA, with its high storage density and long-term stability, is a potential candidate for a next-generation storage device. The DNA data storage channel, composed of synthesis, amplification, storage, and sequencing, exhibits error probabilities and error profiles specific to the components of the channel. Here, we present Autoturbo-DNA, a PyTorch framework for training error-correcting, overcomplete autoencoders specifically tailored for the DNA data storage channel. It allows training different architecture combinations and using a wide variety of channel component models for noise generation during training. It further supports training the encoder to generate DNA sequences that adhere to user-defined constraints. Autoturbo-DNA exhibits error-correction capabilities close to non-neural-network state-of-the-art error correction and constrained codes for DNA data storage. Our results indicate that neural-network-based codes can be a viable alternative to traditionally designed codes for the DNA data storage channel.

5.
Comput Biol Med ; 171: 108185, 2024 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-38401454

RESUMO

BACKGROUND: Streptococcus agalactiae, commonly known as Group B Streptococcus (GBS), exhibits a broad host range, manifesting as both a beneficial commensal and an opportunistic pathogen across various species. In humans, it poses significant risks, causing neonatal sepsis and meningitis, along with severe infections in adults. Additionally, it impacts livestock by inducing mastitis in bovines and contributing to epidemic mortality in fish populations. Despite its wide host spectrum, the mechanisms enabling GBS to adapt to specific hosts remain inadequately elucidated. Therefore, the development of a rapid and accurate method differentiates GBS strains associated with particular animal hosts based on genome-wide information holds immense potential. Such a tool would not only bolster the identification and containment efforts during GBS outbreaks but also deepen our comprehension of the bacteria's host adaptations spanning humans, livestock, and other natural animal reservoirs. METHODS AND RESULTS: Here, we developed three machine learning models-random forest (RF), logistic regression (LR), and support vector machine (SVM) based on genome-wide mutation data. These models enabled precise prediction of the host origin of GBS, accurately distinguishing between human, bovine, fish, and pig hosts. Moreover, we conducted an interpretable machine learning using SHapley Additive exPlanations (SHAP) and variant annotation to uncover the most influential genomic features and associated genes for each host. Additionally, by meticulously examining misclassified samples, we gained valuable insights into the dynamics of host transmission and the potential for zoonotic infections. CONCLUSIONS: Our study underscores the effectiveness of random forest (RF) and logistic regression (LR) models based on mutation data for accurately predicting GBS host origins. Additionally, we identify the key features associated with each GBS host, thereby enhancing our understanding of the bacteria's host-specific adaptations.


Assuntos
Infecções Estreptocócicas , Streptococcus agalactiae , Feminino , Adulto , Animais , Humanos , Bovinos , Suínos , Streptococcus agalactiae/genética , Infecções Estreptocócicas/veterinária , Genômica , Peixes , Aprendizado de Máquina
6.
Comput Struct Biotechnol J ; 23: 732-741, 2024 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-38298179

RESUMO

The availability of high throughput sequencing tools coupled with the declining costs in the production of DNA sequences has led to the generation of enormous amounts of omics data curated in several databases such as NCBI and EMBL. Identification of similar DNA sequences from these databases is one of the fundamental tasks in bioinformatics. It is essential for discovering homologous sequences in organisms, phylogenetic studies of evolutionary relationships among several biological entities, or detection of pathogens. Improving DNA similarity search is of outmost importance because of the increased complexity of the evergrowing repositories of sequences. Therefore, instead of using the conventional approach of comparing raw sequences, e.g., in fasta format, a numerical representation of the sequences can be used to calculate their similarities and optimize the search process. In this study, we analyzed different approaches for numerical embeddings, including Chaos Game Representation, hashing, and neural networks, and compared them with classical approaches such as principal component analysis. It turned out that neural networks generate embeddings that are able to capture the similarity between DNA sequences as a distance measure and outperform the other approaches on DNA similarity search, significantly.

8.
Bioinformatics ; 39(11)2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37988152

RESUMO

SUMMARY: Federated learning enables collaboration in medicine, where data is scattered across multiple centers without the need to aggregate the data in a central cloud. While, in general, machine learning models can be applied to a wide range of data types, graph neural networks (GNNs) are particularly developed for graphs, which are very common in the biomedical domain. For instance, a patient can be represented by a protein-protein interaction (PPI) network where the nodes contain the patient-specific omics features. Here, we present our Ensemble-GNN software package, which can be used to deploy federated, ensemble-based GNNs in Python. Ensemble-GNN allows to quickly build predictive models utilizing PPI networks consisting of various node features such as gene expression and/or DNA methylation. We exemplary show the results from a public dataset of 981 patients and 8469 genes from the Cancer Genome Atlas (TCGA). AVAILABILITY AND IMPLEMENTATION: The source code is available at https://github.com/pievos101/Ensemble-GNN, and the data at Zenodo (DOI: 10.5281/zenodo.8305122).


Assuntos
Metilação de DNA , Aprendizado de Máquina , Humanos , Redes Neurais de Computação , Mapas de Interação de Proteínas , Software
9.
Infection ; 51(6): 1809-1818, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37828369

RESUMO

PURPOSE AND METHODS: The emergence of coronavirus disease 2019 (COVID-19) has once again affirmed the significant threat of respiratory infections to global public health and the utmost importance of prompt diagnosis in managing and mitigating any pandemic. The nucleic acid amplification test (NAAT) is the primary detection method for most pathogens. Loop-mediated isothermal amplification (LAMP) is a rapid, simple, sensitive, and specific epitome of isothermal NAAT performed using a set of four to six primers. Primer design is a fundamental step in LAMP assays, with several complexities and experimental screening requirements. To address this challenge, an online database is presented here. Its workflow comprises three steps: literature aggregation, data curation, and database and website implementation. RESULTS: LAMPPrimerBank ( https://lampprimerbank.mathematik.uni-marburg.de ) is a manually curated database dedicated to experimentally validated LAMP primers, their peculiarities of assays, and accompanying literature, with a primary emphasis on respiratory pathogens. LAMPPrimerBank, with its user-friendly web interface and an open application programming interface, enables the accelerated and facile exploration, comparison, and exportation of LAMP primer sequences and their respective information from the massively scattered literature. LAMPPrimerBank currently comprises LAMP primers for diagnosing viral, bacterial, and fungal respiratory pathogens. Additionally, to address the challenge of false-positive results generated by nonspecific amplifications, LAMPPrimerBank computationally predicted and visualized the sizes of LAMP products for recorded primer sets in the database. CONCLUSION: LAMPPrimerBank, as a pioneering database in the rapidly expanding field of isothermal NAAT, endeavors to confront the two challenges of the LAMP: primer design and discrimination of false-positive results.


Assuntos
COVID-19 , Técnicas de Diagnóstico Molecular , Humanos , Sensibilidade e Especificidade , Técnicas de Diagnóstico Molecular/métodos , COVID-19/diagnóstico , Técnicas de Amplificação de Ácido Nucleico/métodos
10.
Comput Methods Programs Biomed ; 242: 107843, 2023 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-37832432

RESUMO

OBJECTIVE: Evaluating the performance of multiple complex models, such as those found in biology, medicine, climatology, and machine learning, using conventional approaches is often challenging when using various evaluation metrics simultaneously. The traditional approach, which relies on presenting multi-model evaluation scores in the table, presents an obstacle when determining the similarities between the models and the order of performance. METHODS: By combining statistics, information theory, and data visualization, juxtaposed Taylor and Mutual Information Diagrams permit users to track and summarize the performance of one model or a collection of different models. To uncover linear and nonlinear relationships between models, users may visualize one or both charts. RESULTS: Our library presents the first publicly available implementation of the Mutual Information Diagram and its new interactive capabilities, as well as the first publicly available implementation of an interactive Taylor Diagram. Extensions have been implemented so that both diagrams can display temporality, multimodality, and multivariate data sets, and feature one scalar model property such as uncertainty. Our library, named polar-diagrams, supports both continuous and categorical attributes. CONCLUSION: The library can be used to quickly and easily assess the performances of complex models, such as those found in machine learning, climate, or biomedical domains.

11.
Front Genet ; 14: 1213829, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37564874

RESUMO

Next-generation sequencing has revolutionized the field of microbiology research and greatly expanded our knowledge of complex bacterial communities. Nanopore sequencing provides distinct advantages, combining cost-effectiveness, ease of use, high throughput, and high taxonomic resolution through its ability to process long amplicons, such as the entire 16s rRNA genome. We examine the performance of the conventional 27F primer (27F-I) included in the 16S Barcoding Kit distributed by Oxford Nanopore Technologies (ONT) and that of a more degenerate 27F primer (27F-II) in the context of highly complex bacterial communities in 73 human fecal samples. The results show striking differences in both taxonomic diversity and relative abundance of a substantial number of taxa between the two primer sets. Primer 27F-I reveals a significantly lower biodiversity and, for example, at the taxonomic level of the phyla, a dominance of Firmicutes and Proteobacteria as determined by relative abundances, as well as an unusually high ratio of Firmicutes/Bacteriodetes when compared to the more degenerate primer set (27F-II). Considering the findings in the context of the gut microbiomes common in Western industrial societies, as reported in the American Gut Project, the more degenerate primer set (27F-II) reflects the composition and diversity of the fecal microbiome significantly better than the 27F-I primer. This study provides a fundamentally relevant comparative analysis of the in situ performance of two primer sets designed for sequencing of the entire 16s rRNA genome and suggests that the more degenerate primer set (27F-II) should be preferred for nanopore sequencing-based analyses of the human fecal microbiome.

12.
J Med Internet Res ; 25: e47540, 2023 08 29.
Artigo em Inglês | MEDLINE | ID: mdl-37642995

RESUMO

Artificial intelligence (AI) and data sharing go hand in hand. In order to develop powerful AI models for medical and health applications, data need to be collected and brought together over multiple centers. However, due to various reasons, including data privacy, not all data can be made publicly available or shared with other parties. Federated and swarm learning can help in these scenarios. However, in the private sector, such as between companies, the incentive is limited, as the resulting AI models would be available for all partners irrespective of their individual contribution, including the amount of data provided by each party. Here, we explore a potential solution to this challenge as a viewpoint, aiming to establish a fairer approach that encourages companies to engage in collaborative data analysis and AI modeling. Within the proposed approach, each individual participant could gain a model commensurate with their respective data contribution, ultimately leading to better diagnostic tools for all participants in a fair manner.


Assuntos
Inteligência Artificial , Análise de Dados , Disseminação de Informação
14.
Front Genet ; 14: 1217860, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37441549

RESUMO

Polygenic risk scores (PRS) calculate the risk for a specific disease based on the weighted sum of associated alleles from different genetic loci in the germline estimated by regression models. Recent advances in genetics made it possible to create polygenic predictors of complex human traits, including risks for many important complex diseases, such as cancer, diabetes, or cardiovascular diseases, typically influenced by many genetic variants, each of which has a negligible effect on overall risk. In the current study, we analyzed whether adding additional PRS from other diseases to the prediction models and replacing the regressions with machine learning models can improve overall predictive performance. Results showed that multi-PRS models outperform single-PRS models significantly on different diseases. Moreover, replacing regression models with machine learning models, i.e., deep learning, can also improve overall accuracy.

15.
J Med Internet Res ; 25: e42621, 2023 07 12.
Artigo em Inglês | MEDLINE | ID: mdl-37436815

RESUMO

BACKGROUND: Machine learning and artificial intelligence have shown promising results in many areas and are driven by the increasing amount of available data. However, these data are often distributed across different institutions and cannot be easily shared owing to strict privacy regulations. Federated learning (FL) allows the training of distributed machine learning models without sharing sensitive data. In addition, the implementation is time-consuming and requires advanced programming skills and complex technical infrastructures. OBJECTIVE: Various tools and frameworks have been developed to simplify the development of FL algorithms and provide the necessary technical infrastructure. Although there are many high-quality frameworks, most focus only on a single application case or method. To our knowledge, there are no generic frameworks, meaning that the existing solutions are restricted to a particular type of algorithm or application field. Furthermore, most of these frameworks provide an application programming interface that needs programming knowledge. There is no collection of ready-to-use FL algorithms that are extendable and allow users (eg, researchers) without programming knowledge to apply FL. A central FL platform for both FL algorithm developers and users does not exist. This study aimed to address this gap and make FL available to everyone by developing FeatureCloud, an all-in-one platform for FL in biomedicine and beyond. METHODS: The FeatureCloud platform consists of 3 main components: a global frontend, a global backend, and a local controller. Our platform uses a Docker to separate the local acting components of the platform from the sensitive data systems. We evaluated our platform using 4 different algorithms on 5 data sets for both accuracy and runtime. RESULTS: FeatureCloud removes the complexity of distributed systems for developers and end users by providing a comprehensive platform for executing multi-institutional FL analyses and implementing FL algorithms. Through its integrated artificial intelligence store, federated algorithms can easily be published and reused by the community. To secure sensitive raw data, FeatureCloud supports privacy-enhancing technologies to secure the shared local models and assures high standards in data privacy to comply with the strict General Data Protection Regulation. Our evaluation shows that applications developed in FeatureCloud can produce highly similar results compared with centralized approaches and scale well for an increasing number of participating sites. CONCLUSIONS: FeatureCloud provides a ready-to-use platform that integrates the development and execution of FL algorithms while reducing the complexity to a minimum and removing the hurdles of federated infrastructure. Thus, we believe that it has the potential to greatly increase the accessibility of privacy-preserving and distributed data analyses in biomedicine and beyond.


Assuntos
Algoritmos , Inteligência Artificial , Humanos , Ocupações em Saúde , Software , Redes de Comunicação de Computadores , Privacidade
16.
Front Med (Lausanne) ; 10: 1180746, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37342494

RESUMO

Introduction: Community-acquired pneumonia (CAP) and acute exacerbations of chronic obstructive pulmonary disease (AECOPD) result in high morbidity, mortality, and socio-economic burden. The usage of easily accessible biomarkers informing on disease entity, severity, prognosis, and pathophysiological endotypes is limited in clinical practice. Here, we have analyzed selected plasma markers for their value in differential diagnosis and severity grading in a clinical cohort. Methods: A pilot cohort of hospitalized patients suffering from CAP (n = 27), AECOPD (n = 10), and healthy subjects (n = 22) were characterized clinically. Clinical scores (PSI, CURB, CRB65, GOLD I-IV, and GOLD ABCD) were obtained, and interleukin-6 (IL-6), interleukin-8 (IL-8), interleukin-2-receptor (IL-2R), lipopolysaccharide-binding protein (LBP), resistin, thrombospondin-1 (TSP-1), lactotransferrin (LTF), neutrophil gelatinase-associated lipocalin (NGAL), neutrophil-elastase-2 (ELA2), hepatocyte growth factor (HGF), soluble Fas (sFas), as well as TNF-related apoptosis-inducing ligand (TRAIL) were measured in plasma. Results: In CAP patients and healthy volunteers, we found significantly different levels of ELA2, HGF, IL-2R, IL-6, IL-8, LBP, resistin, LTF, and TRAIL. The panel of LBP, sFas, and TRAIL could discriminate between uncomplicated and severe CAP. AECOPD patients showed significantly different levels of LTF and TRAIL compared to healthy subjects. Ensemble feature selection revealed that CAP and AECOPD can be discriminated by IL-6, resistin, together with IL-2R. These factors even allow the differentiation between COPD patients suffering from an exacerbation or pneumonia. Discussion: Taken together, we identified immune mediators in patient plasma that provide information on differential diagnosis and disease severity and can therefore serve as biomarkers. Further studies are required for validation in bigger cohorts.

17.
Bioinformatics ; 39(5)2023 05 04.
Artigo em Inglês | MEDLINE | ID: mdl-37195463

RESUMO

MOTIVATION: Identifying organellar DNA, such as mitochondrial or plastid sequences, inside a whole genome assembly, remains challenging and requires biological background knowledge. To address this, we developed ODNA based on genome annotation and machine learning to fulfill. RESULTS: ODNA is a software that classifies organellar DNA sequences within a genome assembly by machine learning based on a predefined genome annotation workflow. We trained our model with 829 769 DNA sequences from 405 genome assemblies and achieved high predictive performance (e.g. matthew's correlation coefficient of 0.61 for mitochondria and 0.73 for chloroplasts) on independent validation data, thus outperforming existing approaches significantly. AVAILABILITY AND IMPLEMENTATION: Our software ODNA is freely accessible as a web service at https://odna.mathematik.uni-marburg.de and can also be run in a docker container. The source code can be found at https://gitlab.com/mosga/odna and the processed data at Zenodo (DOI: 10.5281/zenodo.7506483).


Assuntos
Mitocôndrias , Organelas , Análise de Sequência de DNA , Mitocôndrias/genética , Software , Aprendizado de Máquina , DNA
18.
EBioMedicine ; 92: 104616, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-37209533

RESUMO

BACKGROUND: Gastric cancer (GC) is clinically heterogenous according to location (cardia/non-cardia) and histopathology (diffuse/intestinal). We aimed to characterize the genetic risk architecture of GC according to its subtypes. Another aim was to examine whether cardia GC and oesophageal adenocarcinoma (OAC) and its precursor lesion Barrett's oesophagus (BO), which are all located at the gastro-oesophageal junction (GOJ), share polygenic risk architecture. METHODS: We did a meta-analysis of ten European genome-wide association studies (GWAS) of GC and its subtypes. All patients had a histopathologically confirmed diagnosis of gastric adenocarcinoma. For the identification of risk genes among GWAS loci we did a transcriptome-wide association study (TWAS) and expression quantitative trait locus (eQTL) study from gastric corpus and antrum mucosa. To test whether cardia GC and OAC/BO share genetic aetiology we also used a European GWAS sample with OAC/BO. FINDINGS: Our GWAS consisting of 5816 patients and 10,999 controls highlights the genetic heterogeneity of GC according to its subtypes. We newly identified two and replicated five GC risk loci, all of them with subtype-specific association. The gastric transcriptome data consisting of 361 corpus and 342 antrum mucosa samples revealed that an upregulated expression of MUC1, ANKRD50, PTGER4, and PSCA are plausible GC-pathomechanisms at four GWAS loci. At another risk locus, we found that the blood-group 0 exerts protective effects for non-cardia and diffuse GC, while blood-group A increases risk for both GC subtypes. Furthermore, our GWAS on cardia GC and OAC/BO (10,279 patients, 16,527 controls) showed that both cancer entities share genetic aetiology at the polygenic level and identified two new risk loci on the single-marker level. INTERPRETATION: Our findings show that the pathophysiology of GC is genetically heterogenous according to location and histopathology. Moreover, our findings point to common molecular mechanisms underlying cardia GC and OAC/BO. FUNDING: German Research Foundation (DFG).


Assuntos
Adenocarcinoma , Esôfago de Barrett , Neoplasias Esofágicas , Neoplasias Gástricas , Humanos , Neoplasias Gástricas/genética , Estudo de Associação Genômica Ampla , Heterogeneidade Genética , Esôfago de Barrett/genética , Adenocarcinoma/patologia , Neoplasias Esofágicas/genética , Fatores de Risco
19.
Comput Struct Biotechnol J ; 21: 1573-1583, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36874157

RESUMO

Loss of the Y chromosome (LoY) is frequently observed in somatic cells of elderly men. However, LoY is highly increased in tumor tissue and correlates with an overall worse prognosis. The underlying causes and downstream effects of LoY are widely unknown. Therefore, we analyzed genomic and transcriptomic data of 13 cancer types (2375 patients) and classified tumors of male patients according to loss or retain of the Y chromosome (LoY or RoY, average LoY fraction: 0.46). The frequencies of LoY ranged from almost absence (glioblastoma, glioma, thyroid carcinoma) to 77% (kidney renal papillary cell carcinoma). Genomic instability, aneuploidy, and mutation burden were enriched in LoY tumors. In addition, we found more frequently in LoY tumors the gate keeping tumor suppressor gene TP53 mutated in three cancer types (colon adenocarcinoma, head and neck squamous carcinoma, lung adenocarcinoma) and oncogenes MET, CDK6, KRAS, and EGFR amplified in multiple cancer types. On the transcriptomic level, we observed MMP13, known to be involved in invasion, to be up-regulated in LoY of three adenocarcinomas and down-regulation of the tumor suppressor gene GPC5 in LoY of three cancer types. Furthermore, we found enrichment of a smoking-related mutation signature in LoY tumors of head and neck and lung cancer. Strikingly, we observed a correlation between cancer type-specific sex bias in incidence rates and frequencies of LoY, in line with the hypothesis that LoY increases cancer risk in males. Overall, LoY is a frequent phenomenon in cancer that is enriched in genomically unstable tumors. It correlates with genomic features beyond the Y chromosome and might contribute to higher incidence rates in males.

20.
BioData Min ; 16(1): 10, 2023 Mar 16.
Artigo em Inglês | MEDLINE | ID: mdl-36927546

RESUMO

BACKGROUND: Owing to the rising levels of multi-resistant pathogens, antimicrobial peptides, an alternative strategy to classic antibiotics, got more attention. A crucial part is thereby the costly identification and validation. With the ever-growing amount of annotated peptides, researchers leverage artificial intelligence to circumvent the cumbersome, wet-lab-based identification and automate the detection of promising candidates. However, the prediction of a peptide's function is not limited to antimicrobial efficiency. To date, multiple studies successfully classified additional properties, e.g., antiviral or cell-penetrating effects. In this light, ensemble classifiers are employed aiming to further improve the prediction. Although we recently presented a workflow to significantly diminish the initial encoding choice, an entire unsupervised encoding selection, considering various machine learning models, is still lacking. RESULTS: We developed a workflow, automatically selecting encodings and generating classifier ensembles by employing sophisticated pruning methods. We observed that the Pareto frontier pruning is a good method to create encoding ensembles for the datasets at hand. In addition, encodings combined with the Decision Tree classifier as the base model are often superior. However, our results also demonstrate that none of the ensemble building techniques is outstanding for all datasets. CONCLUSION: The workflow conducts multiple pruning methods to evaluate ensemble classifiers composed from a wide range of peptide encodings and base models. Consequently, researchers can use the workflow for unsupervised encoding selection and ensemble creation. Ultimately, the extensible workflow can be used as a plugin for the PEPTIDE REACToR, further establishing it as a versatile tool in the domain.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA