Pesquisa | BVS Violência e Saúde

1.

PriPath: identifying dysregulated pathways from differential gene expression via grouping, scoring, and modeling with an embedded feature selection approach.

Yousef, Malik; Ozdemir, Fatma; Jaber, Amhar; Allmer, Jens; Bakir-Gungor, Burcu.

BMC Bioinformatics ; 24(1): 60, 2023 Feb 23.

Artigo em Inglês | MEDLINE | ID: mdl-36823571

RESUMO

BACKGROUND: Cell homeostasis relies on the concerted actions of genes, and dysregulated genes can lead to diseases. In living organisms, genes or their products do not act alone but within networks. Subsets of these networks can be viewed as modules that provide specific functionality to an organism. The Kyoto encyclopedia of genes and genomes (KEGG) systematically analyzes gene functions, proteins, and molecules and combines them into pathways. Measurements of gene expression (e.g., RNA-seq data) can be mapped to KEGG pathways to determine which modules are affected or dysregulated in the disease. However, genes acting in multiple pathways and other inherent issues complicate such analyses. Many current approaches may only employ gene expression data and need to pay more attention to some of the existing knowledge stored in KEGG pathways for detecting dysregulated pathways. New methods that consider more precompiled information are required for a more holistic association between gene expression and diseases. RESULTS: PriPath is a novel approach that transfers the generic process of grouping and scoring, followed by modeling to analyze gene expression with KEGG pathways. In PriPath, KEGG pathways are utilized as the grouping function as part of a machine learning algorithm for selecting the most significant KEGG pathways. A machine learning model is trained to differentiate between diseases and controls using those groups. We have tested PriPath on 13 gene expression datasets of various cancers and other diseases. Our proposed approach successfully assigned biologically and clinically relevant KEGG terms to the samples based on the differentially expressed genes. We have comparatively evaluated the performance of PriPath against other tools, which are similar in their merit. For each dataset, we manually confirmed the top results of PriPath in the literature and found that most predictions can be supported by previous experimental research. CONCLUSIONS: PriPath can thus aid in determining dysregulated pathways, which applies to medical diagnostics. In the future, we aim to advance this approach so that it can perform patient stratification based on gene expression and identify druggable targets. Thereby, we cover two aspects of precision medicine.

Assuntos

Biologia Computacional , Neoplasias , Humanos , Biologia Computacional/métodos , Neoplasias/genética , Genoma , Algoritmos , Expressão Gênica , Perfilação da Expressão Gênica

2.

Predictive factors for degenerative lumbar spinal stenosis: a model obtained from a machine learning algorithm technique.

Abbas, Janan; Yousef, Malik; Peled, Natan; Hershkovitz, Israel; Hamoud, Kamal.

BMC Musculoskelet Disord ; 24(1): 218, 2023 Mar 23.

Artigo em Inglês | MEDLINE | ID: mdl-36949452

RESUMO

BACKGROUND: Degenerative lumbar spinal stenosis (DLSS) is the most common spine disease in the elderly population. It is usually associated with lumbar spine joints/or ligaments degeneration. Machine learning technique is an exclusive method for handling big data analysis; however, the development of this method for spine pathology is rare. This study aims to detect the essential variables that predict the development of symptomatic DLSS using the random forest of machine learning (ML) algorithms technique. METHODS: A retrospective study with two groups of individuals. The first included 165 with symptomatic DLSS (sex ratio 80 M/85F), and the second included 180 individuals from the general population (sex ratio: 90 M/90F) without lumbar spinal stenosis symptoms. Lumbar spine measurements such as vertebral or spinal canal diameters from L1 to S1 were conducted on computerized tomography (CT) images. Demographic and health data of all the participants (e.g., body mass index and diabetes mellitus) were also recorded. RESULTS: The decision tree model of ML demonstrate that the anteroposterior diameter of the bony canal at L5 (males) and L4 (females) levels have the greatest stimulus for symptomatic DLSS (scores of 1 and 0.938). In addition, combination of these variables with other lumbar spine features is mandatory for developing the DLSS. CONCLUSIONS: Our results indicate that combination of lumbar spine characteristics such as bony canal and vertebral body dimensions rather than the presence of a sole variable is highly associated with symptomatic DLSS onset.

Assuntos

Doenças da Coluna Vertebral , Estenose Espinal , Masculino , Feminino , Humanos , Idoso , Estenose Espinal/diagnóstico , Estudos Retrospectivos , Doenças da Coluna Vertebral/patologia , Tomografia Computadorizada por Raios X , Vértebras Lombares/diagnóstico por imagem , Vértebras Lombares/patologia , Algoritmos

3.

Correlates of Hookah Smoking among Arab Adults in Israel Identified by a Machine Learning Algorithm.

Khatib, Mohammad; Sheikh Muhammad, Ahmad; Hadid, Salam; Ben Shlomo, Izhar; Yousef, Malik.

Isr Med Assoc J ; 24(4): 246-252, 2022 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-35415984

RESUMO

BACKGROUND: Hookah smoking is a common activity around the world and has recently become a trend among youth. Studies have indicated a relationship between hookah smoking and a high prevalence of chronic diseases, cancer, cardiovascular, and infectious diseases. In Israel, there has been a sharp increase in hookah smoking among the Arabs. Most studies have focused mainly on hookah smoking among young people. OBJECTIVES: To examine the association between hookah smoking and socioeconomic characteristics, health status and behaviors, and knowledge in the adult Arab population and to build a prediction model using machine learning methods. METHODS: This quantitative study based is on data from the Health and Environment Survey conducted by the Galilee Society in 2015-2016. The data were collected through face-to-face interviews with 2046 adults aged 18 years and older. RESULTS: Using machine learning, a prediction model was built based on eight features. Of the total study population, 13.0% smoked hookah. In the 18-34 age group, 19.5% smoked. Men, people with lower level of health knowledge, heavy consumers of energy drinks and alcohol, and unemployed people were more likely to smoke hookah. Younger and more educated people were more likely to smoke hookah. CONCLUSIONS: Hookah smoking is a widespread behavior among adult Arabs in Israel. The model generated by our study is intended to help health organizations reach people at risk for smoking hookah and to suggest different approaches to eliminate this phenomenon.

Assuntos

Árabes , Fumar Cachimbo de Água , Adolescente , Adulto , Algoritmos , Humanos , Israel/epidemiologia , Aprendizado de Máquina , Masculino , Fumar Cachimbo de Água/epidemiologia , Adulto Jovem

4.

maTE: discovering expressed interactions between microRNAs and their targets.

Yousef, Malik; Abdallah, Loai; Allmer, Jens.

Bioinformatics ; 35(20): 4020-4028, 2019 10 15.

Artigo em Inglês | MEDLINE | ID: mdl-30895309

RESUMO

MOTIVATION: Disease is often manifested via changes in transcript and protein abundance. MicroRNAs (miRNAs) are instrumental in regulating protein abundance and may measurably influence transcript levels. miRNAs often target more than one mRNA (for humans, the average is three), and mRNAs are often targeted by more than one miRNA (for the genes considered in this study, the average is also three). Therefore, it is difficult to determine the miRNAs that may cause the observed differential gene expression. We present a novel approach, maTE, which is based on machine learning, that integrates information about miRNA target genes with gene expression data. maTE depends on the availability of a sufficient amount of patient and control samples. The samples are used to train classifiers to accurately classify the samples on a per miRNA basis. Multiple high scoring miRNAs are used to build a final classifier to improve separation. RESULTS: The aim of the study is to find a set of miRNAs causing the regulation of their target genes that best explains the difference between groups (e.g. cancer versus control). maTE provides a list of significant groups of genes where each group is targeted by a specific miRNA. For the datasets used in this study, maTE generally achieves an accuracy well above 80%. Also, the results show that when the accuracy is much lower (e.g. â¼50%), the set of miRNAs provided is likely not causative of the difference in expression. This new approach of integrating miRNA regulation with expression data yields powerful results and is independent of external labels and training data. Thereby, this approach allows new avenues for exploring miRNA regulation and may enable the development of miRNA-based biomarkers and drugs. AVAILABILITY AND IMPLEMENTATION: The KNIME workflow, implementing maTE, is available at Bioinformatics online. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

MicroRNAs/genética , Perfilação da Expressão Gênica , Humanos , Aprendizado de Máquina , Neoplasias , RNA Mensageiro

5.

Application of Biological Domain Knowledge Based Feature Selection on Gene Expression Data.

Yousef, Malik; Kumar, Abhishek; Bakir-Gungor, Burcu.

Entropy (Basel) ; 23(1)2020 Dec 22.

Artigo em Inglês | MEDLINE | ID: mdl-33374969

RESUMO

In the last two decades, there have been massive advancements in high throughput technologies, which resulted in the exponential growth of public repositories of gene expression datasets for various phenotypes. It is possible to unravel biomarkers by comparing the gene expression levels under different conditions, such as disease vs. control, treated vs. not treated, drug A vs. drug B, etc. This problem refers to a well-studied problem in the machine learning domain, i.e., the feature selection problem. In biological data analysis, most of the computational feature selection methodologies were taken from other fields, without considering the nature of the biological data. Thus, integrative approaches that utilize the biological knowledge while performing feature selection are necessary for this kind of data. The main idea behind the integrative gene selection process is to generate a ranked list of genes considering both the statistical metrics that are applied to the gene expression data, and the biological background information which is provided as external datasets. One of the main goals of this review is to explore the existing methods that integrate different types of information in order to improve the identification of the biomolecular signatures of diseases and the discovery of new potential targets for treatment. These integrative approaches are expected to aid the prediction, diagnosis, and treatment of diseases, as well as to enlighten us on disease state dynamics, mechanisms of their onset and progression. The integration of various types of biological information will necessitate the development of novel techniques for integration and data analysis. Another aim of this review is to boost the bioinformatics community to develop new approaches for searching and determining significant groups/clusters of features based on one or more biological grouping functions.

6.

Sequence-based information-theoretic features for gene essentiality prediction.

Nigatu, Dawit; Sobetzko, Patrick; Yousef, Malik; Henkel, Werner.

BMC Bioinformatics ; 18(1): 473, 2017 Nov 09.

Artigo em Inglês | MEDLINE | ID: mdl-29121868

RESUMO

BACKGROUND: Identification of essential genes is not only useful for our understanding of the minimal gene set required for cellular life but also aids the identification of novel drug targets in pathogens. In this work, we present a simple and effective gene essentiality prediction method using information-theoretic features that are derived exclusively from the gene sequences. RESULTS: We developed a Random Forest classifier and performed an extensive model performance evaluation among and within 15 selected bacteria. In intra-organism predictions, where training and testing sets are taken from the same organism, AUC (Area Under the Curve) scores ranging from 0.73 to 0.90, 0.84 on average, were obtained. Cross-organism predictions using 5-fold cross-validation, pairwise, leave-one-species-out, leave-one-taxon-out, and cross-taxon yielded average AUC scores of 0.88, 0.75, 0.80, 0.82, and 0.78, respectively. To further show the applicability of our method in other domains of life, we predicted the essential genes of the yeast Schizosaccharomyces pombe and obtained a similar accuracy (AUC 0.84). CONCLUSIONS: The proposed method enables a simple and reliable identification of essential genes without searching in databases for orthologs and demanding further experimental data such as network topology and gene-expression.

Assuntos

Bactérias/genética , Genes Essenciais , Modelos Teóricos , Área Sob a Curva , Sequência de Bases , Aprendizado de Máquina , Cadeias de Markov , Curva ROC

7.

MicroRNA categorization using sequence motifs and k-mers.

Yousef, Malik; Khalifa, Waleed; Acar, Ilhan Erkin; Allmer, Jens.

BMC Bioinformatics ; 18(1): 170, 2017 Mar 14.

Artigo em Inglês | MEDLINE | ID: mdl-28292266

RESUMO

BACKGROUND: Post-transcriptional gene dysregulation can be a hallmark of diseases like cancer and microRNAs (miRNAs) play a key role in the modulation of translation efficiency. Known pre-miRNAs are listed in miRBase, and they have been discovered in a variety of organisms ranging from viruses and microbes to eukaryotic organisms. The computational detection of pre-miRNAs is of great interest, and such approaches usually employ machine learning to discriminate between miRNAs and other sequences. Many features have been proposed describing pre-miRNAs, and we have previously introduced the use of sequence motifs and k-mers as useful ones. There have been reports of xeno-miRNAs detected via next generation sequencing. However, they may be contaminations and to aid that important decision-making process, we aimed to establish a means to differentiate pre-miRNAs from different species. RESULTS: To achieve distinction into species, we used one species' pre-miRNAs as the positive and another species' pre-miRNAs as the negative training and test data for the establishment of machine learned models based on sequence motifs and k-mers as features. This approach resulted in higher accuracy values between distantly related species while species with closer relation produced lower accuracy values. CONCLUSIONS: We were able to differentiate among species with increasing success when the evolutionary distance increases. This conclusion is supported by previous reports of fast evolutionary changes in miRNAs since even in relatively closely related species a fairly good discrimination was possible.

Assuntos

MicroRNAs/metabolismo , Animais , Sequência de Bases , Fabaceae/classificação , Fabaceae/genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , MicroRNAs/química , MicroRNAs/genética , Filogenia , Precursores de RNA/genética , Precursores de RNA/metabolismo

8.

Novel Antimicrobial Peptide Design Using Motif Match Score Representation.

Soylemez, Ummu Gulsum; Yousef, Malik; Bakir-Gungor, Burcu.

IEEE/ACM Trans Comput Biol Bioinform ; PP2024 Jun 12.

Artigo em Inglês | MEDLINE | ID: mdl-38865233

RESUMO

Antimicrobial peptides (AMPs) have drawn the interest of the researchers since they offer an alternative to the traditional antibiotics in the fight against antibiotic resistance and they exhibit additional pharmaceutically significant properties. Recently, computational approaches attemp to reveal how antibacterial activity is determined from a machine learning perspective and they aim to search and find the biological cues or characteristics that control antimicrobial activity via incorporating motif match scores. This study is dedicated to the development of a machine learning framework aimed at devising novel antimicrobial peptide (AMP) sequences potentially effective against Gram-positive /Gram-negative bacteria. In order to design newly generated sequences classified as either AMP or non-AMP, various classification models were trained. These novel sequences underwent validation utilizingthe "DBAASP:strain-specific antibacterial prediction based on machine learning approaches and data on AMP sequences" tool. The findings presented herein represent a significant stride in this computational research, streamlining the process of AMP creation or modification within wet lab environments.

9.

Delta-aminolevulinate-induced host-parasite porphyric disparity for selective photolysis of transgenic Leishmania in the phagolysosomes of mononuclear phagocytes: a potential novel platform for vaccine delivery.

Dutta, Sujoy; Chang, Celia; Kolli, Bala Krishna; Sassa, Shigeru; Yousef, Malik; Showe, Michael; Showe, Louise; Chang, Kwang-Poo.

Eukaryot Cell ; 11(4): 430-41, 2012 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-22307976

RESUMO

Leishmania double transfectants (DTs) expressing the 2nd and 3rd enzymes in the heme biosynthetic pathway were previously reported to show neogenesis of uroporphyrin I (URO) when induced with delta-aminolevulinate (ALA), the product of the 1st enzyme in the pathway. The ensuing accumulation of URO in DT promastigotes rendered them light excitable to produce reactive oxygen species (ROS), resulting in their cytolysis. Evidence is presented showing that the DTs retained wild-type infectivity to their host cells and that the intraphagolysosomal/parasitophorous vacuolar (PV) DTs remained ALA inducible for uroporphyrinogenesis/photolysis. Exposure of DT-infected cells to ALA was noted by fluorescence microscopy to result in host-parasite differential porphyrinogenesis: porphyrin fluorescence emerged first in the host cells and then in the intra-PV amastigotes. DT-infected and control cells differed qualitatively and quantitatively in their porphyrin species, consistent with the expected multi- and monoporphyrinogenic specificities of the host cells and the DTs, respectively. After ALA removal, the neogenic porphyrins were rapidly lost from the host cells but persisted as URO in the intra-PV DTs. These DTs were thus extremely light sensitive and were lysed selectively by illumination under nonstringent conditions in the relatively ROS-resistant phagolysosomes. Photolysis of the intra-PV DTs returned the distribution of major histocompatibility complex (MHC) class II molecules and the global gene expression profiles of host cells to their preinfection patterns and, when transfected with ovalbumin, released this antigen for copresentation with MHC class I molecules. These Leishmania mutants thus have considerable potential as a novel model of a universal vaccine carrier for photodynamic immunotherapy/immunoprophylaxis.

Assuntos

Ácido Aminolevulínico/farmacologia , Leishmania/genética , Fagócitos/parasitologia , Fagossomos/parasitologia , Fármacos Fotossensibilizantes/farmacologia , Porfirinas/biossíntese , Vacinação/métodos , Animais , Apresentação de Antígeno , Antígenos de Protozoários/imunologia , Células Cultivadas , Células Dendríticas/metabolismo , Células Dendríticas/parasitologia , Células Dendríticas/efeitos da radiação , Perfilação da Expressão Gênica , Antígenos de Histocompatibilidade Classe I/metabolismo , Leishmania/imunologia , Leishmania/efeitos da radiação , Macrófagos Peritoneais/metabolismo , Macrófagos Peritoneais/parasitologia , Macrófagos Peritoneais/efeitos da radiação , Camundongos , Camundongos Endogâmicos BALB C , Análise de Sequência com Séries de Oligonucleotídeos , Organismos Geneticamente Modificados/imunologia , Fotólise

10.

Deep learning in bioinformatics.

Yousef, Malik; Allmer, Jens.

Turk J Biol ; 47(6): 366-382, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-38681776

RESUMO

Deep learning is a powerful machine learning technique that can learn from large amounts of data using multiple layers of artificial neural networks. This paper reviews some applications of deep learning in bioinformatics, a field that deals with analyzing and interpreting biological data. We first introduce the basic concepts of deep learning and then survey the recent advances and challenges of applying deep learning to various bioinformatics problems, such as genome sequencing, gene expression analysis, protein structure prediction, drug discovery, and disease diagnosis. We also discuss future directions and opportunities for deep learning in bioinformatics. We aim to provide an overview of deep learning so that bioinformaticians applying deep learning models can consider all critical technical and ethical aspects. Thus, our target audience is biomedical informatics researchers who use deep learning models for inference. This review will inspire more bioinformatics researchers to adopt deep-learning methods for their research questions while considering fairness, potential biases, explainability, and accountability.

11.

miRGediNET: A comprehensive examination of common genes in miRNA-Target interactions and disease associations: Insights from a grouping-scoring-modeling approach.

Qumsiyeh, Emma; Salah, Zaidoun; Yousef, Malik.

Heliyon ; 9(12): e22666, 2023 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-38090011

RESUMO

In the broad and complex field of biological data analysis, researchers frequently gather information from a single source or database. Despite being a widespread practice, this has disadvantages. Relying exclusively on a single source can limit our comprehension as it may omit various perspectives that could be obtained by combining multiple knowledge bases. Acknowledging this shortcoming, we report on miRGediNET, a novel approach combining information from three biological databases. Our investigation focuses on microRNAs (miRNAs), small non-coding RNA molecules that regulate gene expression post-transcriptionally. We delve deeply into the knowledge of these miRNA's interactions with genes and the possible effects these interactions may have on different diseases. The scientific community has long recognized a direct correlation between the progression of specific diseases and miRNAs, as well as the genes they target. By using miRGediNET, we go beyond simply acknowledging this relationship. Rather, we actively look for the critical genes that could act as links between the actions of miRNAs and the mechanisms underlying disease. Our methodology, which carefully identifies and investigates these important genes, is supported by a strategic framework that may open up new possibilities for comprehending diseases and creating treatments. We have developed a tool on the Knime platform as a concrete application of our research. This tool serves as both a validation of our study and an invitation to the larger community to interact with, investigate, and build upon our findings. miRGediNET is publicly accessible on GitHub at https://github.com/malikyousef/miRGediNET, providing a collaborative environment for additional research and innovation for enthusiasts and fellow researchers.

12.

TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information.

Voskergian, Daniel; Bakir-Gungor, Burcu; Yousef, Malik.

Front Genet ; 14: 1243874, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37867598

RESUMO

With the exponential growth in the daily publication of scientific articles, automatic classification and categorization can assist in assigning articles to a predefined category. Article titles are concise descriptions of the articles' content with valuable information that can be useful in document classification and categorization. However, shortness, data sparseness, limited word occurrences, and the inadequate contextual information of scientific document titles hinder the direct application of conventional text mining and machine learning algorithms on these short texts, making their classification a challenging task. This study firstly explores the performance of our earlier study, TextNetTopics on the short text. Secondly, here we propose an advanced version called TextNetTopics Pro, which is a novel short-text classification framework that utilizes a promising combination of lexical features organized in topics of words and topic distribution extracted by a topic model to alleviate the data-sparseness problem when classifying short texts. We evaluate our proposed approach using nine state-of-the-art short-text topic models on two publicly available datasets of scientific article titles as short-text documents. The first dataset is related to the Biomedical field, and the other one is related to Computer Science publications. Additionally, we comparatively evaluate the predictive performance of the models generated with and without using the abstracts. Finally, we demonstrate the robustness and effectiveness of the proposed approach in handling the imbalanced data, particularly in the classification of Drug-Induced Liver Injury articles as part of the CAMDA challenge. Taking advantage of the semantic information detected by topic models proved to be a reliable way to improve the overall performance of ML classifiers.

13.

GeNetOntology: identifying affected gene ontology terms via grouping, scoring, and modeling of gene expression data utilizing biological knowledge-based machine learning.

Ersoz, Nur Sebnem; Bakir-Gungor, Burcu; Yousef, Malik.

Front Genet ; 14: 1139082, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37671046

RESUMO

Introduction: Identifying significant sets of genes that are up/downregulated under specific conditions is vital to understand disease development mechanisms at the molecular level. Along this line, in order to analyze transcriptomic data, several computational feature selection (i.e., gene selection) methods have been proposed. On the other hand, uncovering the core functions of the selected genes provides a deep understanding of diseases. In order to address this problem, biological domain knowledge-based feature selection methods have been proposed. Unlike computational gene selection approaches, these domain knowledge-based methods take the underlying biology into account and integrate knowledge from external biological resources. Gene Ontology (GO) is one such biological resource that provides ontology terms for defining the molecular function, cellular component, and biological process of the gene product. Methods: In this study, we developed a tool named GeNetOntology which performs GO-based feature selection for gene expression data analysis. In the proposed approach, the process of Grouping, Scoring, and Modeling (G-S-M) is used to identify significant GO terms. GO information has been used as the grouping information, which has been embedded into a machine learning (ML) algorithm to select informative ontology terms. The genes annotated with the selected ontology terms have been used in the training part to carry out the classification task of the ML model. The output is an important set of ontologies for the two-class classification task applied to gene expression data for a given phenotype. Results: Our approach has been tested on 11 different gene expression datasets, and the results showed that GeNetOntology successfully identified important disease-related ontology terms to be used in the classification model. Discussion: GeNetOntology will assist geneticists and scientists to identify a range of disease-related genes and ontologies in transcriptomic data analysis, and it will also help doctors design diagnosis platforms and improve patient treatment plans.

14.

Review of feature selection approaches based on grouping of features.

Kuzudisli, Cihan; Bakir-Gungor, Burcu; Bulut, Nurten; Qaqish, Bahjat; Yousef, Malik.

PeerJ ; 11: e15666, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37483989

RESUMO

With the rapid development in technology, large amounts of high-dimensional data have been generated. This high dimensionality including redundancy and irrelevancy poses a great challenge in data analysis and decision making. Feature selection (FS) is an effective way to reduce dimensionality by eliminating redundant and irrelevant data. Most traditional FS approaches score and rank each feature individually; and then perform FS either by eliminating lower ranked features or by retaining highly-ranked features. In this review, we discuss an emerging approach to FS that is based on initially grouping features, then scoring groups of features rather than scoring individual features. Despite the presence of reviews on clustering and FS algorithms, to the best of our knowledge, this is the first review focusing on FS techniques based on grouping. The typical idea behind FS through grouping is to generate groups of similar features with dissimilarity between groups, then select representative features from each cluster. Approaches under supervised, unsupervised, semi supervised and integrative frameworks are explored. The comparison of experimental results indicates the effectiveness of sequential, optimization-based (i.e., fuzzy or evolutionary), hybrid and multi-method approaches. When it comes to biological data, the involvement of external biological sources can improve analysis results. We hope this work's findings can guide effective design of new FS approaches using feature grouping.

Assuntos

Algoritmos

15.

microBiomeGSM: the identification of taxonomic biomarkers from metagenomic data using grouping, scoring and modeling (G-S-M) approach.

Bakir-Gungor, Burcu; Temiz, Mustafa; Jabeer, Amhar; Wu, Di; Yousef, Malik.

Front Microbiol ; 14: 1264941, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-38075911

RESUMO

Numerous biological environments have been characterized with the advent of metagenomic sequencing using next generation sequencing which lays out the relative abundance values of microbial taxa. Modeling the human microbiome using machine learning models has the potential to identify microbial biomarkers and aid in the diagnosis of a variety of diseases such as inflammatory bowel disease, diabetes, colorectal cancer, and many others. The goal of this study is to develop an effective classification model for the analysis of metagenomic datasets associated with different diseases. In this way, we aim to identify taxonomic biomarkers associated with these diseases and facilitate disease diagnosis. The microBiomeGSM tool presented in this work incorporates the pre-existing taxonomy information into a machine learning approach and challenges to solve the classification problem in metagenomics disease-associated datasets. Based on the G-S-M (Grouping-Scoring-Modeling) approach, species level information is used as features and classified by relating their taxonomic features at different levels, including genus, family, and order. Using four different disease associated metagenomics datasets, the performance of microBiomeGSM is comparatively evaluated with other feature selection methods such as Fast Correlation Based Filter (FCBF), Select K Best (SKB), Extreme Gradient Boosting (XGB), Conditional Mutual Information Maximization (CMIM), Maximum Likelihood and Minimum Redundancy (MRMR) and Information Gain (IG), also with other classifiers such as AdaBoost, Decision Tree, LogitBoost and Random Forest. microBiomeGSM achieved the highest results with an Area under the curve (AUC) value of 0.98% at the order taxonomic level for IBDMD dataset. Another significant output of microBiomeGSM is the list of taxonomic groups that are identified as important for the disease under study and the names of the species within these groups. The association between the detected species and the disease under investigation is confirmed by previous studies in the literature. The microBiomeGSM tool and other supplementary files are publicly available at: https://github.com/malikyousef/microBiomeGSM.

16.

Invention of 3Mint for feature grouping and scoring in multi-omics.

Unlu Yazici, Miray; Marron, J S; Bakir-Gungor, Burcu; Zou, Fei; Yousef, Malik.

Front Genet ; 14: 1093326, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-37007972

RESUMO

Advanced genomic and molecular profiling technologies accelerated the enlightenment of the regulatory mechanisms behind cancer development and progression, and the targeted therapies in patients. Along this line, intense studies with immense amounts of biological information have boosted the discovery of molecular biomarkers. Cancer is one of the leading causes of death around the world in recent years. Elucidation of genomic and epigenetic factors in Breast Cancer (BRCA) can provide a roadmap to uncover the disease mechanisms. Accordingly, unraveling the possible systematic connections between-omics data types and their contribution to BRCA tumor progression is crucial. In this study, we have developed a novel machine learning (ML) based integrative approach for multi-omics data analysis. This integrative approach combines information from gene expression (mRNA), microRNA (miRNA) and methylation data. Due to the complexity of cancer, this integrated data is expected to improve the prediction, diagnosis and treatment of disease through patterns only available from the 3-way interactions between these 3-omics datasets. In addition, the proposed method bridges the interpretation gap between the disease mechanisms that drive onset and progression. Our fundamental contribution is the 3 Multi-omics integrative tool (3Mint). This tool aims to perform grouping and scoring of groups using biological knowledge. Another major goal is improved gene selection via detection of novel groups of cross-omics biomarkers. Performance of 3Mint is assessed using different metrics. Our computational performance evaluations showed that the 3Mint classifies the BRCA molecular subtypes with lower number of genes when compared to the miRcorrNet tool which uses miRNA and mRNA gene expression profiles in terms of similar performance metrics (95% Accuracy). The incorporation of methylation data in 3Mint yields a much more focused analysis. The 3Mint tool and all other supplementary files are available at https://github.com/malikyousef/3Mint/.

17.

A toolbox of machine learning software to support microbiome analysis.

Marcos-Zambrano, Laura Judith; López-Molina, Víctor Manuel; Bakir-Gungor, Burcu; Frohme, Marcus; Karaduzovic-Hadziabdic, Kanita; Klammsteiner, Thomas; Ibrahimi, Eliana; Lahti, Leo; Loncar-Turukalo, Tatjana; Dhamo, Xhilda; Simeon, Andrea; Nechyporenko, Alina; Pio, Gianvito; Przymus, Piotr; Sampri, Alexia; Trajkovik, Vladimir; Lacruz-Pleguezuelos, Blanca; Aasmets, Oliver; Araujo, Ricardo; Anagnostopoulos, Ioannis; Aydemir, Önder; Berland, Magali; Calle, M Luz; Ceci, Michelangelo; Duman, Hatice; Gündogdu, Aycan; Havulinna, Aki S; Kaka Bra, Kardokh Hama Najib; Kalluci, Eglantina; Karav, Sercan; Lode, Daniel; Lopes, Marta B; May, Patrick; Nap, Bram; Nedyalkova, Miroslava; Paciência, Inês; Pasic, Lejla; Pujolassos, Meritxell; Shigdel, Rajesh; Susín, Antonio; Thiele, Ines; Truica, Ciprian-Octavian; Wilmes, Paul; Yilmaz, Ercument; Yousef, Malik; Claesson, Marcus Joakim; Truu, Jaak; Carrillo de Santa Pau, Enrique.

Front Microbiol ; 14: 1250806, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-38075858

RESUMO

The human microbiome has become an area of intense research due to its potential impact on human health. However, the analysis and interpretation of this data have proven to be challenging due to its complexity and high dimensionality. Machine learning (ML) algorithms can process vast amounts of data to uncover informative patterns and relationships within the data, even with limited prior knowledge. Therefore, there has been a rapid growth in the development of software specifically designed for the analysis and interpretation of microbiome data using ML techniques. These software incorporate a wide range of ML algorithms for clustering, classification, regression, or feature selection, to identify microbial patterns and relationships within the data and generate predictive models. This rapid development with a constant need for new developments and integration of new features require efforts into compile, catalog and classify these tools to create infrastructures and services with easy, transparent, and trustable standards. Here we review the state-of-the-art for ML tools applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on ML based software and framework resources currently available for the analysis of microbiome data in humans. The aim is to support microbiologists and biomedical scientists to go deeper into specialized resources that integrate ML techniques and facilitate future benchmarking to create standards for the analysis of microbiome data. The software resources are organized based on the type of analysis they were developed for and the ML techniques they implement. A description of each software with examples of usage is provided including comments about pitfalls and lacks in the usage of software based on ML methods in relation to microbiome data that need to be considered by developers and users. This review represents an extensive compilation to date, offering valuable insights and guidance for researchers interested in leveraging ML approaches for microbiome analysis.

18.

TextNetTopics: Text Classification Based Word Grouping as Topics and Topics' Scoring.

Yousef, Malik; Voskergian, Daniel.

Front Genet ; 13: 893378, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-35795215

RESUMO

Medical document classification is one of the active research problems and the most challenging within the text classification domain. Medical datasets often contain massive feature sets where many features are considered irrelevant, redundant, and add noise, thus, reducing the classification performance. Therefore, to obtain a better accuracy of a classification model, it is crucial to choose a set of features (terms) that best discriminate between the classes of medical documents. This study proposes TextNetTopics, a novel approach that applies feature selection by considering Bag-of-topics (BOT) rather than the traditional approach, Bag-of-words (BOW). Thus our approach performs topic selections rather than words selection. TextNetTopics is based on the generic approach entitled G-S-M (Grouping, Scoring, and Modeling), developed by Yousef and his colleagues and used mainly in biological data. The proposed approach suggests scoring topics to select the top topics for training the classifier. This study applied TextNetTopics to textual data to respond to the CAMDA challenge. TextNetTopics outperforms various feature selection approaches while highly performing when applying the model to the validation data provided by the CAMDA. Additionally, we have applied our algorithm to different textual datasets.

19.

miRModuleNet: Detecting miRNA-mRNA Regulatory Modules.

Yousef, Malik; Goy, Gokhan; Bakir-Gungor, Burcu.

Front Genet ; 13: 767455, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-35495139

RESUMO

Increasing evidence that microRNAs (miRNAs) play a key role in carcinogenesis has revealed the need for elucidating the mechanisms of miRNA regulation and the roles of miRNAs in gene-regulatory networks. A better understanding of the interactions between miRNAs and their mRNA targets will provide a better understanding of the complex biological processes that occur during carcinogenesis. Increased efforts to reveal these interactions have led to the development of a variety of tools to detect and understand these interactions. We have recently described a machine learning approach miRcorrNet, based on grouping and scoring (ranking) groups of genes, where each group is associated with a miRNA and the group members are genes with expression patterns that are correlated with this specific miRNA. The miRcorrNet tool requires two types of -omics data, miRNA and mRNA expression profiles, as an input file. In this study we describe miRModuleNet, which groups mRNA (genes) that are correlated with each miRNA to form a star shape, which we identify as a miRNA-mRNA regulatory module. A scoring procedure is then applied to each module to further assess their contribution in terms of classification. An important output of miRModuleNet is that it provides a hierarchical list of significant miRNA-mRNA regulatory modules. miRModuleNet was further validated on external datasets for their disease associations, and functional enrichment analysis was also performed. The application of miRModuleNet aids the identification of functional relationships between significant biomarkers and reveals essential pathways involved in cancer pathogenesis. The miRModuleNet tool and all other supplementary files are available at https://github.com/malikyousef/miRModuleNet/.

20.

GediNET for discovering gene associations across diseases using knowledge based machine learning approach.

Qumsiyeh, Emma; Showe, Louise; Yousef, Malik.

Sci Rep ; 12(1): 19955, 2022 11 19.

Artigo em Inglês | MEDLINE | ID: mdl-36402891

RESUMO

The most common approaches to discovering genes associated with specific diseases are based on machine learning and use a variety of feature selection techniques to identify significant genes that can serve as biomarkers for a given disease. More recently, the integration in this process of prior knowledge-based approaches has shown significant promise in the discovery of new biomarkers with potential translational applications. In this study, we developed a novel approach, GediNET, that integrates prior biological knowledge to gene Groups that are shown to be associated with a specific disease such as a cancer. The novelty of GediNET is that it then also allows the discovery of significant associations between that specific disease and other diseases. The initial step in this process involves the identification of gene Groups. The Groups are then subjected to a Scoring component to identify the top performing classification Groups. The top-ranked gene Groups are then used to train a Machine Learning Model. The process of Grouping, Scoring and Modelling (G-S-M) is used by GediNET to identify other diseases that are similarly associated with this signature. GediNET identifies these relationships through Disease-Disease Association (DDA) based machine learning. DDA explores novel associations between diseases and identifies relationships which could be used to further improve approaches to diagnosis, prognosis, and treatment. The GediNET KNIME workflow can be downloaded from: https://github.com/malikyousef/GediNET.git or https://kni.me/w/3kH1SQV_mMUsMTS .

Assuntos

Bases de Conhecimento , Aprendizado de Máquina , Biomarcadores , Proteômica

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA