Búsqueda | BVS CLAP/SMR-OPS/OMS

1.

PriPath: identifying dysregulated pathways from differential gene expression via grouping, scoring, and modeling with an embedded feature selection approach.

Yousef, Malik; Ozdemir, Fatma; Jaber, Amhar; Allmer, Jens; Bakir-Gungor, Burcu.

BMC Bioinformatics ; 24(1): 60, 2023 Feb 23.

Artículo en Inglés | MEDLINE | ID: mdl-36823571

RESUMEN

BACKGROUND: Cell homeostasis relies on the concerted actions of genes, and dysregulated genes can lead to diseases. In living organisms, genes or their products do not act alone but within networks. Subsets of these networks can be viewed as modules that provide specific functionality to an organism. The Kyoto encyclopedia of genes and genomes (KEGG) systematically analyzes gene functions, proteins, and molecules and combines them into pathways. Measurements of gene expression (e.g., RNA-seq data) can be mapped to KEGG pathways to determine which modules are affected or dysregulated in the disease. However, genes acting in multiple pathways and other inherent issues complicate such analyses. Many current approaches may only employ gene expression data and need to pay more attention to some of the existing knowledge stored in KEGG pathways for detecting dysregulated pathways. New methods that consider more precompiled information are required for a more holistic association between gene expression and diseases. RESULTS: PriPath is a novel approach that transfers the generic process of grouping and scoring, followed by modeling to analyze gene expression with KEGG pathways. In PriPath, KEGG pathways are utilized as the grouping function as part of a machine learning algorithm for selecting the most significant KEGG pathways. A machine learning model is trained to differentiate between diseases and controls using those groups. We have tested PriPath on 13 gene expression datasets of various cancers and other diseases. Our proposed approach successfully assigned biologically and clinically relevant KEGG terms to the samples based on the differentially expressed genes. We have comparatively evaluated the performance of PriPath against other tools, which are similar in their merit. For each dataset, we manually confirmed the top results of PriPath in the literature and found that most predictions can be supported by previous experimental research. CONCLUSIONS: PriPath can thus aid in determining dysregulated pathways, which applies to medical diagnostics. In the future, we aim to advance this approach so that it can perform patient stratification based on gene expression and identify druggable targets. Thereby, we cover two aspects of precision medicine.

Asunto(s)

Biología Computacional , Neoplasias , Humanos , Biología Computacional/métodos , Neoplasias/genética , Genoma , Algoritmos , Expresión Génica , Perfilación de la Expresión Génica

2.

Predictive factors for degenerative lumbar spinal stenosis: a model obtained from a machine learning algorithm technique.

Abbas, Janan; Yousef, Malik; Peled, Natan; Hershkovitz, Israel; Hamoud, Kamal.

BMC Musculoskelet Disord ; 24(1): 218, 2023 Mar 23.

Artículo en Inglés | MEDLINE | ID: mdl-36949452

RESUMEN

BACKGROUND: Degenerative lumbar spinal stenosis (DLSS) is the most common spine disease in the elderly population. It is usually associated with lumbar spine joints/or ligaments degeneration. Machine learning technique is an exclusive method for handling big data analysis; however, the development of this method for spine pathology is rare. This study aims to detect the essential variables that predict the development of symptomatic DLSS using the random forest of machine learning (ML) algorithms technique. METHODS: A retrospective study with two groups of individuals. The first included 165 with symptomatic DLSS (sex ratio 80 M/85F), and the second included 180 individuals from the general population (sex ratio: 90 M/90F) without lumbar spinal stenosis symptoms. Lumbar spine measurements such as vertebral or spinal canal diameters from L1 to S1 were conducted on computerized tomography (CT) images. Demographic and health data of all the participants (e.g., body mass index and diabetes mellitus) were also recorded. RESULTS: The decision tree model of ML demonstrate that the anteroposterior diameter of the bony canal at L5 (males) and L4 (females) levels have the greatest stimulus for symptomatic DLSS (scores of 1 and 0.938). In addition, combination of these variables with other lumbar spine features is mandatory for developing the DLSS. CONCLUSIONS: Our results indicate that combination of lumbar spine characteristics such as bony canal and vertebral body dimensions rather than the presence of a sole variable is highly associated with symptomatic DLSS onset.

Asunto(s)

Enfermedades de la Columna Vertebral , Estenosis Espinal , Masculino , Femenino , Humanos , Anciano , Estenosis Espinal/diagnóstico , Estudios Retrospectivos , Enfermedades de la Columna Vertebral/patología , Tomografía Computarizada por Rayos X , Vértebras Lumbares/diagnóstico por imagen , Vértebras Lumbares/patología , Algoritmos

3.

Correlates of Hookah Smoking among Arab Adults in Israel Identified by a Machine Learning Algorithm.

Khatib, Mohammad; Sheikh Muhammad, Ahmad; Hadid, Salam; Ben Shlomo, Izhar; Yousef, Malik.

Isr Med Assoc J ; 24(4): 246-252, 2022 Apr.

Artículo en Inglés | MEDLINE | ID: mdl-35415984

RESUMEN

BACKGROUND: Hookah smoking is a common activity around the world and has recently become a trend among youth. Studies have indicated a relationship between hookah smoking and a high prevalence of chronic diseases, cancer, cardiovascular, and infectious diseases. In Israel, there has been a sharp increase in hookah smoking among the Arabs. Most studies have focused mainly on hookah smoking among young people. OBJECTIVES: To examine the association between hookah smoking and socioeconomic characteristics, health status and behaviors, and knowledge in the adult Arab population and to build a prediction model using machine learning methods. METHODS: This quantitative study based is on data from the Health and Environment Survey conducted by the Galilee Society in 2015-2016. The data were collected through face-to-face interviews with 2046 adults aged 18 years and older. RESULTS: Using machine learning, a prediction model was built based on eight features. Of the total study population, 13.0% smoked hookah. In the 18-34 age group, 19.5% smoked. Men, people with lower level of health knowledge, heavy consumers of energy drinks and alcohol, and unemployed people were more likely to smoke hookah. Younger and more educated people were more likely to smoke hookah. CONCLUSIONS: Hookah smoking is a widespread behavior among adult Arabs in Israel. The model generated by our study is intended to help health organizations reach people at risk for smoking hookah and to suggest different approaches to eliminate this phenomenon.

Asunto(s)

Árabes , Fumar en Pipa de Agua , Adolescente , Adulto , Algoritmos , Humanos , Israel/epidemiología , Aprendizaje Automático , Masculino , Fumar en Pipa de Agua/epidemiología , Adulto Joven

4.

maTE: discovering expressed interactions between microRNAs and their targets.

Yousef, Malik; Abdallah, Loai; Allmer, Jens.

Bioinformatics ; 35(20): 4020-4028, 2019 10 15.

Artículo en Inglés | MEDLINE | ID: mdl-30895309

RESUMEN

MOTIVATION: Disease is often manifested via changes in transcript and protein abundance. MicroRNAs (miRNAs) are instrumental in regulating protein abundance and may measurably influence transcript levels. miRNAs often target more than one mRNA (for humans, the average is three), and mRNAs are often targeted by more than one miRNA (for the genes considered in this study, the average is also three). Therefore, it is difficult to determine the miRNAs that may cause the observed differential gene expression. We present a novel approach, maTE, which is based on machine learning, that integrates information about miRNA target genes with gene expression data. maTE depends on the availability of a sufficient amount of patient and control samples. The samples are used to train classifiers to accurately classify the samples on a per miRNA basis. Multiple high scoring miRNAs are used to build a final classifier to improve separation. RESULTS: The aim of the study is to find a set of miRNAs causing the regulation of their target genes that best explains the difference between groups (e.g. cancer versus control). maTE provides a list of significant groups of genes where each group is targeted by a specific miRNA. For the datasets used in this study, maTE generally achieves an accuracy well above 80%. Also, the results show that when the accuracy is much lower (e.g. â¼50%), the set of miRNAs provided is likely not causative of the difference in expression. This new approach of integrating miRNA regulation with expression data yields powerful results and is independent of external labels and training data. Thereby, this approach allows new avenues for exploring miRNA regulation and may enable the development of miRNA-based biomarkers and drugs. AVAILABILITY AND IMPLEMENTATION: The KNIME workflow, implementing maTE, is available at Bioinformatics online. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

MicroARNs/genética , Perfilación de la Expresión Génica , Humanos , Aprendizaje Automático , Neoplasias , ARN Mensajero

5.

Application of Biological Domain Knowledge Based Feature Selection on Gene Expression Data.

Yousef, Malik; Kumar, Abhishek; Bakir-Gungor, Burcu.

Entropy (Basel) ; 23(1)2020 Dec 22.

Artículo en Inglés | MEDLINE | ID: mdl-33374969

RESUMEN

In the last two decades, there have been massive advancements in high throughput technologies, which resulted in the exponential growth of public repositories of gene expression datasets for various phenotypes. It is possible to unravel biomarkers by comparing the gene expression levels under different conditions, such as disease vs. control, treated vs. not treated, drug A vs. drug B, etc. This problem refers to a well-studied problem in the machine learning domain, i.e., the feature selection problem. In biological data analysis, most of the computational feature selection methodologies were taken from other fields, without considering the nature of the biological data. Thus, integrative approaches that utilize the biological knowledge while performing feature selection are necessary for this kind of data. The main idea behind the integrative gene selection process is to generate a ranked list of genes considering both the statistical metrics that are applied to the gene expression data, and the biological background information which is provided as external datasets. One of the main goals of this review is to explore the existing methods that integrate different types of information in order to improve the identification of the biomolecular signatures of diseases and the discovery of new potential targets for treatment. These integrative approaches are expected to aid the prediction, diagnosis, and treatment of diseases, as well as to enlighten us on disease state dynamics, mechanisms of their onset and progression. The integration of various types of biological information will necessitate the development of novel techniques for integration and data analysis. Another aim of this review is to boost the bioinformatics community to develop new approaches for searching and determining significant groups/clusters of features based on one or more biological grouping functions.

6.

MicroRNA categorization using sequence motifs and k-mers.

Yousef, Malik; Khalifa, Waleed; Acar, Ilhan Erkin; Allmer, Jens.

BMC Bioinformatics ; 18(1): 170, 2017 Mar 14.

Artículo en Inglés | MEDLINE | ID: mdl-28292266

RESUMEN

BACKGROUND: Post-transcriptional gene dysregulation can be a hallmark of diseases like cancer and microRNAs (miRNAs) play a key role in the modulation of translation efficiency. Known pre-miRNAs are listed in miRBase, and they have been discovered in a variety of organisms ranging from viruses and microbes to eukaryotic organisms. The computational detection of pre-miRNAs is of great interest, and such approaches usually employ machine learning to discriminate between miRNAs and other sequences. Many features have been proposed describing pre-miRNAs, and we have previously introduced the use of sequence motifs and k-mers as useful ones. There have been reports of xeno-miRNAs detected via next generation sequencing. However, they may be contaminations and to aid that important decision-making process, we aimed to establish a means to differentiate pre-miRNAs from different species. RESULTS: To achieve distinction into species, we used one species' pre-miRNAs as the positive and another species' pre-miRNAs as the negative training and test data for the establishment of machine learned models based on sequence motifs and k-mers as features. This approach resulted in higher accuracy values between distantly related species while species with closer relation produced lower accuracy values. CONCLUSIONS: We were able to differentiate among species with increasing success when the evolutionary distance increases. This conclusion is supported by previous reports of fast evolutionary changes in miRNAs since even in relatively closely related species a fairly good discrimination was possible.

Asunto(s)

MicroARNs/metabolismo , Animales , Secuencia de Bases , Fabaceae/clasificación , Fabaceae/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , MicroARNs/química , MicroARNs/genética , Filogenia , Precursores del ARN/genética , Precursores del ARN/metabolismo

7.

Sequence-based information-theoretic features for gene essentiality prediction.

Nigatu, Dawit; Sobetzko, Patrick; Yousef, Malik; Henkel, Werner.

BMC Bioinformatics ; 18(1): 473, 2017 Nov 09.

Artículo en Inglés | MEDLINE | ID: mdl-29121868

RESUMEN

BACKGROUND: Identification of essential genes is not only useful for our understanding of the minimal gene set required for cellular life but also aids the identification of novel drug targets in pathogens. In this work, we present a simple and effective gene essentiality prediction method using information-theoretic features that are derived exclusively from the gene sequences. RESULTS: We developed a Random Forest classifier and performed an extensive model performance evaluation among and within 15 selected bacteria. In intra-organism predictions, where training and testing sets are taken from the same organism, AUC (Area Under the Curve) scores ranging from 0.73 to 0.90, 0.84 on average, were obtained. Cross-organism predictions using 5-fold cross-validation, pairwise, leave-one-species-out, leave-one-taxon-out, and cross-taxon yielded average AUC scores of 0.88, 0.75, 0.80, 0.82, and 0.78, respectively. To further show the applicability of our method in other domains of life, we predicted the essential genes of the yeast Schizosaccharomyces pombe and obtained a similar accuracy (AUC 0.84). CONCLUSIONS: The proposed method enables a simple and reliable identification of essential genes without searching in databases for orthologs and demanding further experimental data such as network topology and gene-expression.

Asunto(s)

Bacterias/genética , Genes Esenciales , Modelos Teóricos , Área Bajo la Curva , Secuencia de Bases , Aprendizaje Automático , Cadenas de Markov , Curva ROC

8.

Novel Antimicrobial Peptide Design Using Motif Match Score Representation.

Soylemez, Ummu Gulsum; Yousef, Malik; Bakir-Gungor, Burcu.

IEEE/ACM Trans Comput Biol Bioinform ; PP2024 Jun 12.

Artículo en Inglés | MEDLINE | ID: mdl-38865233

RESUMEN

Antimicrobial peptides (AMPs) have drawn the interest of the researchers since they offer an alternative to the traditional antibiotics in the fight against antibiotic resistance and they exhibit additional pharmaceutically significant properties. Recently, computational approaches attemp to reveal how antibacterial activity is determined from a machine learning perspective and they aim to search and find the biological cues or characteristics that control antimicrobial activity via incorporating motif match scores. This study is dedicated to the development of a machine learning framework aimed at devising novel antimicrobial peptide (AMP) sequences potentially effective against Gram-positive /Gram-negative bacteria. In order to design newly generated sequences classified as either AMP or non-AMP, various classification models were trained. These novel sequences underwent validation utilizingthe "DBAASP:strain-specific antibacterial prediction based on machine learning approaches and data on AMP sequences" tool. The findings presented herein represent a significant stride in this computational research, streamlining the process of AMP creation or modification within wet lab environments.

9.

CCPred: Global and population-specific colorectal cancer prediction and metagenomic biomarker identification at different molecular levels using machine learning techniques.

Bakir-Gungor, Burcu; Temiz, Mustafa; Inal, Yasin; Cicekyurt, Emre; Yousef, Malik.

Comput Biol Med ; 182: 109098, 2024 Sep 17.

Artículo en Inglés | MEDLINE | ID: mdl-39293338

RESUMEN

Colorectal cancer (CRC) ranks as the third most common cancer globally and the second leading cause of cancer-related deaths. Recent research highlights the pivotal role of the gut microbiota in CRC development and progression. Understanding the complex interplay between disease development and metagenomic data is essential for CRC diagnosis and treatment. Current computational models employ machine learning to identify metagenomic biomarkers associated with CRC, yet there is a need to improve their accuracy through a holistic biological knowledge perspective. This study aims to evaluate CRC-associated metagenomic data at species, enzymes, and pathway levels via conducting global and population-specific analyses. These analyses utilize relative abundance values from human gut microbiome sequencing data and robust classification models are built for disease prediction and biomarker identification. For global CRC prediction and biomarker identification, the features that are identified by SelectKBest (SKB), Information Gain (IG), and Extreme Gradient Boosting (XGBoost) methods are combined. Population-based analysis includes within-population, leave-one-dataset-out (LODO) and cross-population approaches. Four classification algorithms are employed for CRC classification. Random Forest achieved an AUC of 0.83 for species data, 0.78 for enzyme data and 0.76 for pathway data globally. On the global scale, potential taxonomic biomarkers include ruthenibacterium lactatiformanas; enzyme biomarkers include RNA 2' 3' cyclic 3' phosphodiesterase; and pathway biomarkers include pyruvate fermentation to acetone pathway. This study underscores the potential of machine learning models trained on metagenomic data for improved disease prediction and biomarker discovery. The proposed model and associated files are available at https://github.com/TemizMus/CCPRED.

10.

Delta-aminolevulinate-induced host-parasite porphyric disparity for selective photolysis of transgenic Leishmania in the phagolysosomes of mononuclear phagocytes: a potential novel platform for vaccine delivery.

Dutta, Sujoy; Chang, Celia; Kolli, Bala Krishna; Sassa, Shigeru; Yousef, Malik; Showe, Michael; Showe, Louise; Chang, Kwang-Poo.

Eukaryot Cell ; 11(4): 430-41, 2012 Apr.

Artículo en Inglés | MEDLINE | ID: mdl-22307976

RESUMEN

Leishmania double transfectants (DTs) expressing the 2nd and 3rd enzymes in the heme biosynthetic pathway were previously reported to show neogenesis of uroporphyrin I (URO) when induced with delta-aminolevulinate (ALA), the product of the 1st enzyme in the pathway. The ensuing accumulation of URO in DT promastigotes rendered them light excitable to produce reactive oxygen species (ROS), resulting in their cytolysis. Evidence is presented showing that the DTs retained wild-type infectivity to their host cells and that the intraphagolysosomal/parasitophorous vacuolar (PV) DTs remained ALA inducible for uroporphyrinogenesis/photolysis. Exposure of DT-infected cells to ALA was noted by fluorescence microscopy to result in host-parasite differential porphyrinogenesis: porphyrin fluorescence emerged first in the host cells and then in the intra-PV amastigotes. DT-infected and control cells differed qualitatively and quantitatively in their porphyrin species, consistent with the expected multi- and monoporphyrinogenic specificities of the host cells and the DTs, respectively. After ALA removal, the neogenic porphyrins were rapidly lost from the host cells but persisted as URO in the intra-PV DTs. These DTs were thus extremely light sensitive and were lysed selectively by illumination under nonstringent conditions in the relatively ROS-resistant phagolysosomes. Photolysis of the intra-PV DTs returned the distribution of major histocompatibility complex (MHC) class II molecules and the global gene expression profiles of host cells to their preinfection patterns and, when transfected with ovalbumin, released this antigen for copresentation with MHC class I molecules. These Leishmania mutants thus have considerable potential as a novel model of a universal vaccine carrier for photodynamic immunotherapy/immunoprophylaxis.

Asunto(s)

Ácido Aminolevulínico/farmacología , Leishmania/genética , Fagocitos/parasitología , Fagosomas/parasitología , Fármacos Fotosensibilizantes/farmacología , Porfirinas/biosíntesis , Vacunación/métodos , Animales , Presentación de Antígeno , Antígenos de Protozoos/inmunología , Células Cultivadas , Células Dendríticas/metabolismo , Células Dendríticas/parasitología , Células Dendríticas/efectos de la radiación , Perfilación de la Expresión Génica , Antígenos de Histocompatibilidad Clase I/metabolismo , Leishmania/inmunología , Leishmania/efectos de la radiación , Macrófagos Peritoneales/metabolismo , Macrófagos Peritoneales/parasitología , Macrófagos Peritoneales/efectos de la radiación , Ratones , Ratones Endogámicos BALB C , Análisis de Secuencia por Matrices de Oligonucleótidos , Organismos Modificados Genéticamente/inmunología , Fotólisis

11.

Deep learning in bioinformatics.

Yousef, Malik; Allmer, Jens.

Turk J Biol ; 47(6): 366-382, 2023.

Artículo en Inglés | MEDLINE | ID: mdl-38681776

RESUMEN

Deep learning is a powerful machine learning technique that can learn from large amounts of data using multiple layers of artificial neural networks. This paper reviews some applications of deep learning in bioinformatics, a field that deals with analyzing and interpreting biological data. We first introduce the basic concepts of deep learning and then survey the recent advances and challenges of applying deep learning to various bioinformatics problems, such as genome sequencing, gene expression analysis, protein structure prediction, drug discovery, and disease diagnosis. We also discuss future directions and opportunities for deep learning in bioinformatics. We aim to provide an overview of deep learning so that bioinformaticians applying deep learning models can consider all critical technical and ethical aspects. Thus, our target audience is biomedical informatics researchers who use deep learning models for inference. This review will inspire more bioinformatics researchers to adopt deep-learning methods for their research questions while considering fairness, potential biases, explainability, and accountability.

12.

miRGediNET: A comprehensive examination of common genes in miRNA-Target interactions and disease associations: Insights from a grouping-scoring-modeling approach.

Qumsiyeh, Emma; Salah, Zaidoun; Yousef, Malik.

Heliyon ; 9(12): e22666, 2023 Dec.

Artículo en Inglés | MEDLINE | ID: mdl-38090011

RESUMEN

In the broad and complex field of biological data analysis, researchers frequently gather information from a single source or database. Despite being a widespread practice, this has disadvantages. Relying exclusively on a single source can limit our comprehension as it may omit various perspectives that could be obtained by combining multiple knowledge bases. Acknowledging this shortcoming, we report on miRGediNET, a novel approach combining information from three biological databases. Our investigation focuses on microRNAs (miRNAs), small non-coding RNA molecules that regulate gene expression post-transcriptionally. We delve deeply into the knowledge of these miRNA's interactions with genes and the possible effects these interactions may have on different diseases. The scientific community has long recognized a direct correlation between the progression of specific diseases and miRNAs, as well as the genes they target. By using miRGediNET, we go beyond simply acknowledging this relationship. Rather, we actively look for the critical genes that could act as links between the actions of miRNAs and the mechanisms underlying disease. Our methodology, which carefully identifies and investigates these important genes, is supported by a strategic framework that may open up new possibilities for comprehending diseases and creating treatments. We have developed a tool on the Knime platform as a concrete application of our research. This tool serves as both a validation of our study and an invitation to the larger community to interact with, investigate, and build upon our findings. miRGediNET is publicly accessible on GitHub at https://github.com/malikyousef/miRGediNET, providing a collaborative environment for additional research and innovation for enthusiasts and fellow researchers.

13.

GeNetOntology: identifying affected gene ontology terms via grouping, scoring, and modeling of gene expression data utilizing biological knowledge-based machine learning.

Ersoz, Nur Sebnem; Bakir-Gungor, Burcu; Yousef, Malik.

Front Genet ; 14: 1139082, 2023.

Artículo en Inglés | MEDLINE | ID: mdl-37671046

RESUMEN

Introduction: Identifying significant sets of genes that are up/downregulated under specific conditions is vital to understand disease development mechanisms at the molecular level. Along this line, in order to analyze transcriptomic data, several computational feature selection (i.e., gene selection) methods have been proposed. On the other hand, uncovering the core functions of the selected genes provides a deep understanding of diseases. In order to address this problem, biological domain knowledge-based feature selection methods have been proposed. Unlike computational gene selection approaches, these domain knowledge-based methods take the underlying biology into account and integrate knowledge from external biological resources. Gene Ontology (GO) is one such biological resource that provides ontology terms for defining the molecular function, cellular component, and biological process of the gene product. Methods: In this study, we developed a tool named GeNetOntology which performs GO-based feature selection for gene expression data analysis. In the proposed approach, the process of Grouping, Scoring, and Modeling (G-S-M) is used to identify significant GO terms. GO information has been used as the grouping information, which has been embedded into a machine learning (ML) algorithm to select informative ontology terms. The genes annotated with the selected ontology terms have been used in the training part to carry out the classification task of the ML model. The output is an important set of ontologies for the two-class classification task applied to gene expression data for a given phenotype. Results: Our approach has been tested on 11 different gene expression datasets, and the results showed that GeNetOntology successfully identified important disease-related ontology terms to be used in the classification model. Discussion: GeNetOntology will assist geneticists and scientists to identify a range of disease-related genes and ontologies in transcriptomic data analysis, and it will also help doctors design diagnosis platforms and improve patient treatment plans.

14.

TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information.

Voskergian, Daniel; Bakir-Gungor, Burcu; Yousef, Malik.

Front Genet ; 14: 1243874, 2023.

Artículo en Inglés | MEDLINE | ID: mdl-37867598

RESUMEN

With the exponential growth in the daily publication of scientific articles, automatic classification and categorization can assist in assigning articles to a predefined category. Article titles are concise descriptions of the articles' content with valuable information that can be useful in document classification and categorization. However, shortness, data sparseness, limited word occurrences, and the inadequate contextual information of scientific document titles hinder the direct application of conventional text mining and machine learning algorithms on these short texts, making their classification a challenging task. This study firstly explores the performance of our earlier study, TextNetTopics on the short text. Secondly, here we propose an advanced version called TextNetTopics Pro, which is a novel short-text classification framework that utilizes a promising combination of lexical features organized in topics of words and topic distribution extracted by a topic model to alleviate the data-sparseness problem when classifying short texts. We evaluate our proposed approach using nine state-of-the-art short-text topic models on two publicly available datasets of scientific article titles as short-text documents. The first dataset is related to the Biomedical field, and the other one is related to Computer Science publications. Additionally, we comparatively evaluate the predictive performance of the models generated with and without using the abstracts. Finally, we demonstrate the robustness and effectiveness of the proposed approach in handling the imbalanced data, particularly in the classification of Drug-Induced Liver Injury articles as part of the CAMDA challenge. Taking advantage of the semantic information detected by topic models proved to be a reliable way to improve the overall performance of ML classifiers.

15.

microBiomeGSM: the identification of taxonomic biomarkers from metagenomic data using grouping, scoring and modeling (G-S-M) approach.

Bakir-Gungor, Burcu; Temiz, Mustafa; Jabeer, Amhar; Wu, Di; Yousef, Malik.

Front Microbiol ; 14: 1264941, 2023.

Artículo en Inglés | MEDLINE | ID: mdl-38075911

RESUMEN

Numerous biological environments have been characterized with the advent of metagenomic sequencing using next generation sequencing which lays out the relative abundance values of microbial taxa. Modeling the human microbiome using machine learning models has the potential to identify microbial biomarkers and aid in the diagnosis of a variety of diseases such as inflammatory bowel disease, diabetes, colorectal cancer, and many others. The goal of this study is to develop an effective classification model for the analysis of metagenomic datasets associated with different diseases. In this way, we aim to identify taxonomic biomarkers associated with these diseases and facilitate disease diagnosis. The microBiomeGSM tool presented in this work incorporates the pre-existing taxonomy information into a machine learning approach and challenges to solve the classification problem in metagenomics disease-associated datasets. Based on the G-S-M (Grouping-Scoring-Modeling) approach, species level information is used as features and classified by relating their taxonomic features at different levels, including genus, family, and order. Using four different disease associated metagenomics datasets, the performance of microBiomeGSM is comparatively evaluated with other feature selection methods such as Fast Correlation Based Filter (FCBF), Select K Best (SKB), Extreme Gradient Boosting (XGB), Conditional Mutual Information Maximization (CMIM), Maximum Likelihood and Minimum Redundancy (MRMR) and Information Gain (IG), also with other classifiers such as AdaBoost, Decision Tree, LogitBoost and Random Forest. microBiomeGSM achieved the highest results with an Area under the curve (AUC) value of 0.98% at the order taxonomic level for IBDMD dataset. Another significant output of microBiomeGSM is the list of taxonomic groups that are identified as important for the disease under study and the names of the species within these groups. The association between the detected species and the disease under investigation is confirmed by previous studies in the literature. The microBiomeGSM tool and other supplementary files are publicly available at: https://github.com/malikyousef/microBiomeGSM.

16.

Review of feature selection approaches based on grouping of features.

Kuzudisli, Cihan; Bakir-Gungor, Burcu; Bulut, Nurten; Qaqish, Bahjat; Yousef, Malik.

PeerJ ; 11: e15666, 2023.

Artículo en Inglés | MEDLINE | ID: mdl-37483989

RESUMEN

With the rapid development in technology, large amounts of high-dimensional data have been generated. This high dimensionality including redundancy and irrelevancy poses a great challenge in data analysis and decision making. Feature selection (FS) is an effective way to reduce dimensionality by eliminating redundant and irrelevant data. Most traditional FS approaches score and rank each feature individually; and then perform FS either by eliminating lower ranked features or by retaining highly-ranked features. In this review, we discuss an emerging approach to FS that is based on initially grouping features, then scoring groups of features rather than scoring individual features. Despite the presence of reviews on clustering and FS algorithms, to the best of our knowledge, this is the first review focusing on FS techniques based on grouping. The typical idea behind FS through grouping is to generate groups of similar features with dissimilarity between groups, then select representative features from each cluster. Approaches under supervised, unsupervised, semi supervised and integrative frameworks are explored. The comparison of experimental results indicates the effectiveness of sequential, optimization-based (i.e., fuzzy or evolutionary), hybrid and multi-method approaches. When it comes to biological data, the involvement of external biological sources can improve analysis results. We hope this work's findings can guide effective design of new FS approaches using feature grouping.

Asunto(s)

Algoritmos

17.

Invention of 3Mint for feature grouping and scoring in multi-omics.

Unlu Yazici, Miray; Marron, J S; Bakir-Gungor, Burcu; Zou, Fei; Yousef, Malik.

Front Genet ; 14: 1093326, 2023.

Artículo en Inglés | MEDLINE | ID: mdl-37007972

RESUMEN

Advanced genomic and molecular profiling technologies accelerated the enlightenment of the regulatory mechanisms behind cancer development and progression, and the targeted therapies in patients. Along this line, intense studies with immense amounts of biological information have boosted the discovery of molecular biomarkers. Cancer is one of the leading causes of death around the world in recent years. Elucidation of genomic and epigenetic factors in Breast Cancer (BRCA) can provide a roadmap to uncover the disease mechanisms. Accordingly, unraveling the possible systematic connections between-omics data types and their contribution to BRCA tumor progression is crucial. In this study, we have developed a novel machine learning (ML) based integrative approach for multi-omics data analysis. This integrative approach combines information from gene expression (mRNA), microRNA (miRNA) and methylation data. Due to the complexity of cancer, this integrated data is expected to improve the prediction, diagnosis and treatment of disease through patterns only available from the 3-way interactions between these 3-omics datasets. In addition, the proposed method bridges the interpretation gap between the disease mechanisms that drive onset and progression. Our fundamental contribution is the 3 Multi-omics integrative tool (3Mint). This tool aims to perform grouping and scoring of groups using biological knowledge. Another major goal is improved gene selection via detection of novel groups of cross-omics biomarkers. Performance of 3Mint is assessed using different metrics. Our computational performance evaluations showed that the 3Mint classifies the BRCA molecular subtypes with lower number of genes when compared to the miRcorrNet tool which uses miRNA and mRNA gene expression profiles in terms of similar performance metrics (95% Accuracy). The incorporation of methylation data in 3Mint yields a much more focused analysis. The 3Mint tool and all other supplementary files are available at https://github.com/malikyousef/3Mint/.

18.

A toolbox of machine learning software to support microbiome analysis.

Marcos-Zambrano, Laura Judith; López-Molina, Víctor Manuel; Bakir-Gungor, Burcu; Frohme, Marcus; Karaduzovic-Hadziabdic, Kanita; Klammsteiner, Thomas; Ibrahimi, Eliana; Lahti, Leo; Loncar-Turukalo, Tatjana; Dhamo, Xhilda; Simeon, Andrea; Nechyporenko, Alina; Pio, Gianvito; Przymus, Piotr; Sampri, Alexia; Trajkovik, Vladimir; Lacruz-Pleguezuelos, Blanca; Aasmets, Oliver; Araujo, Ricardo; Anagnostopoulos, Ioannis; Aydemir, Önder; Berland, Magali; Calle, M Luz; Ceci, Michelangelo; Duman, Hatice; Gündogdu, Aycan; Havulinna, Aki S; Kaka Bra, Kardokh Hama Najib; Kalluci, Eglantina; Karav, Sercan; Lode, Daniel; Lopes, Marta B; May, Patrick; Nap, Bram; Nedyalkova, Miroslava; Paciência, Inês; Pasic, Lejla; Pujolassos, Meritxell; Shigdel, Rajesh; Susín, Antonio; Thiele, Ines; Truica, Ciprian-Octavian; Wilmes, Paul; Yilmaz, Ercument; Yousef, Malik; Claesson, Marcus Joakim; Truu, Jaak; Carrillo de Santa Pau, Enrique.

Front Microbiol ; 14: 1250806, 2023.

Artículo en Inglés | MEDLINE | ID: mdl-38075858

RESUMEN

The human microbiome has become an area of intense research due to its potential impact on human health. However, the analysis and interpretation of this data have proven to be challenging due to its complexity and high dimensionality. Machine learning (ML) algorithms can process vast amounts of data to uncover informative patterns and relationships within the data, even with limited prior knowledge. Therefore, there has been a rapid growth in the development of software specifically designed for the analysis and interpretation of microbiome data using ML techniques. These software incorporate a wide range of ML algorithms for clustering, classification, regression, or feature selection, to identify microbial patterns and relationships within the data and generate predictive models. This rapid development with a constant need for new developments and integration of new features require efforts into compile, catalog and classify these tools to create infrastructures and services with easy, transparent, and trustable standards. Here we review the state-of-the-art for ML tools applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on ML based software and framework resources currently available for the analysis of microbiome data in humans. The aim is to support microbiologists and biomedical scientists to go deeper into specialized resources that integrate ML techniques and facilitate future benchmarking to create standards for the analysis of microbiome data. The software resources are organized based on the type of analysis they were developed for and the ML techniques they implement. A description of each software with examples of usage is provided including comments about pitfalls and lacks in the usage of software based on ML methods in relation to microbiome data that need to be considered by developers and users. This review represents an extensive compilation to date, offering valuable insights and guidance for researchers interested in leveraging ML approaches for microbiome analysis.

19.

TextNetTopics: Text Classification Based Word Grouping as Topics and Topics' Scoring.

Yousef, Malik; Voskergian, Daniel.

Front Genet ; 13: 893378, 2022.

Artículo en Inglés | MEDLINE | ID: mdl-35795215

RESUMEN

Medical document classification is one of the active research problems and the most challenging within the text classification domain. Medical datasets often contain massive feature sets where many features are considered irrelevant, redundant, and add noise, thus, reducing the classification performance. Therefore, to obtain a better accuracy of a classification model, it is crucial to choose a set of features (terms) that best discriminate between the classes of medical documents. This study proposes TextNetTopics, a novel approach that applies feature selection by considering Bag-of-topics (BOT) rather than the traditional approach, Bag-of-words (BOW). Thus our approach performs topic selections rather than words selection. TextNetTopics is based on the generic approach entitled G-S-M (Grouping, Scoring, and Modeling), developed by Yousef and his colleagues and used mainly in biological data. The proposed approach suggests scoring topics to select the top topics for training the classifier. This study applied TextNetTopics to textual data to respond to the CAMDA challenge. TextNetTopics outperforms various feature selection approaches while highly performing when applying the model to the validation data provided by the CAMDA. Additionally, we have applied our algorithm to different textual datasets.

20.

Computational Methods for Predicting Mature microRNAs.

Yousef, Malik; Parveen, Alisha; Kumar, Abhishek.

Methods Mol Biol ; 2257: 175-185, 2022.

Artículo en Inglés | MEDLINE | ID: mdl-34432279

RESUMEN

Tiny single-stranded noncoding RNAs with size 19-27 nucleotides serve as microRNAs (miRNAs), which have emerged as key gene regulators in the last two decades. miRNAs serve as one of the hallmarks in regulatory pathways with critical roles in human diseases. Ever since the discovery of miRNAs, researchers have focused on how mature miRNAs are produced from precursor mRNAs. Experimental methods are faced with notorious challenges in terms of experimental design, since it is time consuming and not cost-effective. Hence, different computational methods have been employed for the identification of miRNA sequences where most of them labeled as miRNA predictors are in fact pre-miRNA predictors and provide no information about the putative miRNA location within the pre-miRNA. This chapter provides an update and the current state of the art in this area covering various methods and 15 software suites used for prediction of mature miRNA.

Asunto(s)

Biología Computacional , Humanos , MicroARNs/genética , Precursores del ARN , Programas Informáticos

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA