RESUMO
Dairy products are an important source of protein and other nutrients in the Mediterranean diet. In these countries, the most common sources of milk for producing dairy products are cow, goat, sheep, and buffalo. Andalusia is traditionally the largest producer of goat milk in Spain. Kefir is a fermented product made from bacteria and yeasts and has health benefits beyond its nutritional properties. There is a lack of knowledge about the molecular mechanisms and metabolites that bring about these benefits. In this work, the combination of analytical techniques (GC-FID, UHPLC-MS-QToF, GC-QqQ-MS, and GC-ToF-MS) resulted in the detection of 105 metabolites in kefir produced with goat milk from two different thermal treatments (raw and pasteurized) fermented at four time points (12, 24, 36, and 48 h, using 0 h as the control). Of these, 27 metabolites differed between kefir produced with raw and pasteurized milk. These changes could possibly be caused by the effect of pasteurization on the microbial population in the starting milk. Some interesting molecules were identified, such as shikimic acid, dehydroabietic acid, GABA, and tyramine, which could be related to antibacterial properties, strengthening of the immune system, and arterial pressure. Moreover, a viability assay of the NIRS technique was performed to evaluate its use in monitoring the fermentation and classification of samples, which resulted in a 90% accuracy in comparison to correctly classified samples according to their fermentation time. This study represents the most comprehensive metabolomic analysis of goat milk kefir so far, revealing the intricate changes in metabolites during fermentation and the impact of milk treatment.
Assuntos
Fermentação , Cabras , Kefir , Metabolômica , Leite , Animais , Kefir/microbiologia , Metabolômica/métodos , Leite/metabolismo , Leite/química , Leite/microbiologia , Temperatura Alta , Cromatografia Líquida de Alta PressãoRESUMO
Molecular subtyping is essential to infer tumor aggressiveness and predict prognosis. In practice, tumor profiling requires in-depth knowledge of bioinformatics tools involved in the processing and analysis of the generated data. Additionally, data incompatibility (e.g., microarray versus RNA sequencing data) and technical and uncharacterized biological variance between training and test data can pose challenges in classifying individual samples. In this article, we provide a roadmap for implementing bioinformatics frameworks for molecular profiling of human cancers in a clinical diagnostic setting. We describe a framework for integrating several methods for quality control, normalization, batch correction, classification and reporting, and develop a use case of the framework in breast cancer.
Assuntos
Neoplasias da Mama , Perfilação da Expressão Gênica , Humanos , Feminino , Perfilação da Expressão Gênica/métodos , Neoplasias da Mama/diagnóstico , Neoplasias da Mama/genética , RNA , Biologia Computacional/métodos , Regulação Neoplásica da Expressão GênicaRESUMO
To effectively classify tree species within datasets characterized by limited samples, we introduced a novel approach named DenseNetBL, founded upon the fusion of the DenseNet architecture and a pivotal bottleneck layer. This bottleneck layer, encompassing a compact convolutional component, played a central role in our methodology. The evaluation of DenseNetBL was conducted under varying conditions, encompassing small-sample tree species data, extensive remote sensing datasets, and state-of-the-art classifiers. Furthermore, a quantitative assessment was executed to extract tree species areas. This was achieved by quantifying pixel areas within manually delineated tree species maps and classifier-generated counterparts. The findings of our study indicated that, in scenarios devoid of pre-trained weights, DenseNetBL consistently outperformed its DenseNet counterpart with equivalent layer numbers. In the realm of small-sample situations, both the Swin Transformer and Vision Transformer exhibited inferior performance when juxtaposed with DenseNet and DenseNetBL. Remarkably, among the shallow architectures, DenseNet33BL showcased superior aptitude for small-sample tree species classification, culminating in the most commendable results (Overall Accuracy (OA) = 0.901, Kappa = 0.892). Conversely, the Vision Transformer yielded the least favorable classification outcomes (OA = 0.767, Kappa = 0.708). The amalgamation of DenseNet33BL and simple linear iterative clustering emerged as the optimal strategy for attaining robust tree species area extraction results across two prototypical forests. In contrast, DenseNet121 exhibited suboptimal performance in the same forests, attaining the least satisfactory tree species area extraction results. These comprehensive findings underscore the efficacy of our DenseNetBL approach in addressing the challenges associated with small-sample tree species classification and accurate tree species area extraction.
RESUMO
A large amount of evidence shows that biomarkers are discriminant features related to disease development. Thus, the identification of disease biomarkers has become a basic problem in the analysis of complex diseases in the medical fields, such as disease stage judgment, disease diagnosis and treatment. Research based on networks have become one of the most popular methods. Several algorithms based on networks have been proposed to identify biomarkers, however the networks of genes or molecules ignored the similarities and associations among the samples. It is essential to further understand how to construct and optimize the networks to make the identified biomarkers more accurate. On this basis, more effective strategies can be developed to improve the performance of biomarkers identification. In this study, a multi-objective evolution algorithm based on sample similarity networks has been proposed for disease biomarker identification. Specifically, we design the sample similarity networks to extract the structural characteristic information among samples, which used to calculate the influence of the sample to each class. Besides, based on the networks and the group of biomarkers we choose in every iteration, we can divide samples into different classes by the importance for each class. Then, in the process of evolution algorithm population iteration, we develop the elite guidance strategy and fusion selection strategy to select the biomarkers which make the sample classification more accurate. The experiment results on the five gene expression datasets suggests that the algorithm we proposed is superior over some state-of-the-art disease biomarker identification methods.
Assuntos
Algoritmos , BiomarcadoresRESUMO
The ever-growing number of methods for the generation of synthetic bulk and single cell RNA-seq data have multiple and diverse applications. They are often aimed at benchmarking bioinformatics algorithms for purposes such as sample classification, differential expression analysis, correlation and network studies and the optimization of data integration and normalization techniques. Here, we propose a general framework to compare synthetically generated RNA-seq data and select a data-generating tool that is suitable for a set of specific study goals. As there are multiple methods for synthetic RNA-seq data generation, researchers can use the proposed framework to make an informed choice of an RNA-seq data simulation algorithm and software that are best suited for their specific scientific questions of interest.
Assuntos
Algoritmos , Software , RNA-Seq , Análise de Sequência de RNA/métodos , Simulação por ComputadorRESUMO
Soybean is sensitive to low temperatures during the crop growing season. An urgent demand for breeding cold-tolerant cultivars to alleviate the production loss is apparent to cope with this scenario. Cold-tolerant trait is a complex and quantitative trait controlled by multiple genes, environmental factors, and their interaction. In this study, we proposed an advanced systems biology framework of feature engineering for the discovery of cold tolerance genes (CTgenes) from integrated omics and non-omics (OnO) data in soybean. An integrative pipeline was introduced for feature selection and feature extraction from different layers in the integrated OnO data using data ensemble methods and the non-parameter random forest prioritization to minimize uncertainties and false positives for accuracy improvement of results. In total, 44, 143, and 45 CTgenes were identified in short-, mid-, and long-term cold treatment, respectively, from the corresponding gene-pool. These CTgenes outperformed the remaining genes, the random genes, and the other candidate genes identified by other approaches in an independent RNA-seq database. Furthermore, we applied pathway enrichment and crosstalk network analyses to uncover relevant physiological pathways with the discovery of underlying cold tolerance in hormone- and defense-related modules. Our CTgenes were validated by using 55 SNP genotype data of 56 soybean samples in cold tolerance experiments. This suggests that the CTgenes identified from our proposed systematic framework can effectively distinguish cold-resistant and cold-sensitive lines. It is an important advancement in the soybean cold-stress response. The proposed pipelines provide an alternative solution to biomarker discovery, module discovery, and sample classification underlying a particular trait in plants in a robust and efficient way.
RESUMO
Non-targeted analysis (NTA) using high-resolution mass spectrometry has enabled the detection and identification of unknown and unexpected compounds of interest in a wide range of sample matrices. Despite these benefits of NTA methods, standardized procedures do not yet exist for assessing performance, limiting stakeholders' abilities to suitably interpret and utilize NTA results. Herein, we first summarize existing performance assessment metrics for targeted analyses to provide context and clarify terminology that may be shared between targeted and NTA methods (e.g., terms such as accuracy, precision, sensitivity, and selectivity). We then discuss promising approaches for assessing NTA method performance, listing strengths and key caveats for each approach, and highlighting areas in need of further development. To structure the discussion, we define three types of NTA study objectives: sample classification, chemical identification, and chemical quantitation. Qualitative study performance (i.e., focusing on sample classification and/or chemical identification) can be assessed using the traditional confusion matrix, with some challenges and limitations. Quantitative study performance can be assessed using estimation procedures developed for targeted methods with consideration for additional sources of uncontrolled experimental error. This article is intended to stimulate discussion and further efforts to develop and improve procedures for assessing NTA method performance. Ultimately, improved performance assessments will enable accurate communication and effective utilization of NTA results by stakeholders.
Assuntos
Espectrometria de Massas , Espectrometria de Massas/métodosRESUMO
The diversity within different microbiome communities that drive biogeochemical processes influences many different phenotypes. Analyses of these communities and their diversity by countless microbiome projects have revealed an important role of metagenomics in understanding the complex relation between microbes and their environments. This relationship can be understood in the context of microbiome composition of specific known environments. These compositions can then be used as a template for predicting the status of similar environments. Machine learning has been applied as a key component to this predictive task. Several analysis tools have already been published utilizing machine learning methods for metagenomic analysis. Despite the previously proposed machine learning models, the performance of deep neural networks is still under-researched. Given the nature of metagenomic data, deep neural networks could provide a strong boost to growth in the prediction accuracy in metagenomic analysis applications. To meet this urgent demand, we present a deep learning based tool that utilizes a deep neural network implementation for phenotypic prediction of unknown metagenomic samples. (1) First, our tool takes as input taxonomic profiles from 16S or WGS sequencing data. (2) Second, given the samples, our tool builds a model based on a deep neural network by computing multi-level classification. (3) Lastly, given the model, our tool classifies an unknown sample with its unlabeled taxonomic profile. In the benchmark experiments, we deduced that an analysis method facilitating a deep neural network such as our tool can show promising results in increasing the prediction accuracy on several samples compared to other machine learning models.
RESUMO
PURPOSE: In this work, an algorithm named mRBioM was developed for the identification of potential mRNA biomarkers (PmBs) from complete transcriptomic RNA profiles of gastric adenocarcinoma (GA). METHODS: mRBioM initially extracts differentially expressed (DE) RNAs (mRNAs, miRNAs, and lncRNAs). Next, mRBioM calculates the total information amount of each DE mRNA based on the coexpression network, including three types of RNAs and the protein-protein interaction network encoded by DE mRNAs. Finally, PmBs were identified according to the variation trend of total information amount of all DE mRNAs. Four PmB-based classifiers without learning and with learning were designed to discriminate the sample types to confirm the reliability of PmBs identified by mRBioM. PmB-based survival analysis was performed. Finally, three other cancer datasets were used to confirm the generalization ability of mRBioM. RESULTS: mRBioM identified 55 PmBs (41 upregulated and 14 downregulated) related to GA. The list included thirteen PmBs that have been verified as biomarkers or potential therapeutic targets of gastric cancer, and some PmBs were newly identified. Most PmBs were primarily enriched in the pathways closely related to the occurrence and development of gastric cancer. Cancer-related factors without learning achieved sensitivity, specificity, and accuracy of 0.90, 1, and 0.90, respectively, in the classification of the GA and control samples. Average accuracy, sensitivity, and specificity of the three classifiers with machine learning ranged within 0.94-0.98, 0.94-0.97, and 0.97-1, respectively. The prognostic risk score model constructed by 4 PmBs was able to correctly and significantly (∗∗∗ p < 0.001) classify 269 GA patients into the high-risk (n = 134) and low-risk (n = 135) groups. GA equivalent classification performance was achieved using the complete transcriptomic RNA profiles of colon adenocarcinoma, lung adenocarcinoma, and hepatocellular carcinoma using PmBs identified by mRBioM. CONCLUSIONS: GA-related PmBs have high specificity and sensitivity and strong prognostic risk prediction. MRBioM has also good generalization. These PmBs may have good application prospects for early diagnosis of GA and may help to elucidate the mechanism governing the occurrence and development of GA. Additionally, mRBioM is expected to be applied for the identification of other cancer-related biomarkers.
RESUMO
BACKGROUND: The increasing availability of omics data collected from patients affected by severe pathologies, such as cancer, is fostering the development of data science methods for their analysis. INTRODUCTION: The combination of data integration and machine learning approaches can provide new powerful instruments to tackle the complexity of cancer development and deliver effective diagnostic and prognostic strategies. METHODS: We explore the possibility of exploiting the topological properties of sample-specific metabolic networks as features in a supervised classification task. Such networks are obtained by projecting transcriptomic data from RNA-seq experiments on genome-wide metabolic models to define weighted networks modeling the overall metabolic activity of a given sample. RESULTS: We show the classification results on a labeled breast cancer dataset from the TCGA database, including 210 samples (cancer vs. normal). In particular, we investigate how the performance is affected by a threshold-based pruning of the networks by comparing Artificial Neural Networks, Support Vector Machines and Random Forests. Interestingly, the best classification performance is achieved within a small threshold range for all methods, suggesting that it might represent an effective choice to recover useful information while filtering out noise from data. Overall, the best accuracy is achieved with SVMs, which exhibit performances similar to those obtained when gene expression profiles are used as features. CONCLUSION: These findings demonstrate that the topological properties of sample-specific metabolic networks are effective in classifying cancer and normal samples, suggesting that useful information can be extracted from a relatively limited number of features.
RESUMO
Migration of silver nanoparticles (AgNPs) from food containers (FCs) has been assessed for the first time using a screening method previously validated. Migration was evaluated using water and 3% acetic acid as food simulants (FSs), from 20 to 70 °C at contact times of 2 h and 10 days. Total and migrated Ag were determined by inductively coupled plasma-mass spectrometry (ICP-MS) in the FCs and FSs, respectively. Then, the screening method was validated, and probability of detection (POD) curves were constructed in both FSs to characterize the response to AgNPs. The results provided by the present screening method showed no release of AgNPs. The FSs in contact with FCs were spiked at levels above, inside and below the unreliability region, with a reliability rate (RLR) of 0.90. Asymmetric flow field flow fractionation coupled to inductively coupled plasma mass-spectrometry (AF4-ICP-MS) was used for confirmative analyses.
Assuntos
Embalagem de Alimentos , Espectrometria de Massas/métodos , Nanopartículas Metálicas/química , Prata/análise , Prata/química , Ácido Acético/química , Fracionamento por Campo e Fluxo , Tamanho da Partícula , Reprodutibilidade dos Testes , Fatores de Tempo , Água/químicaRESUMO
BACKGROUND: People with neurodegenerative disorders show diverse clinical syndromes, genetic heterogeneity, and distinct brain pathological changes, but studies report overlap between these features. DNA methylation (DNAm) provides a way to explore this overlap and heterogeneity as it is determined by the combined effects of genetic variation and the environment. In this study, we aim to identify shared blood DNAm differences between controls and people with Alzheimer's disease, amyotrophic lateral sclerosis, and Parkinson's disease. RESULTS: We use a mixed-linear model method (MOMENT) that accounts for the effect of (un)known confounders, to test for the association of each DNAm site with each disorder. While only three probes are found to be genome-wide significant in each MOMENT association analysis of amyotrophic lateral sclerosis and Parkinson's disease (and none with Alzheimer's disease), a fixed-effects meta-analysis of the three disorders results in 12 genome-wide significant differentially methylated positions. Predicted immune cell-type proportions are disrupted across all neurodegenerative disorders. Protein inflammatory markers are correlated with profile sum-scores derived from disease-associated immune cell-type proportions in a healthy aging cohort. In contrast, they are not correlated with MOMENT DNAm-derived profile sum-scores, calculated using effect sizes of the 12 differentially methylated positions as weights. CONCLUSIONS: We identify shared differentially methylated positions in whole blood between neurodegenerative disorders that point to shared pathogenic mechanisms. These shared differentially methylated positions may reflect causes or consequences of disease, but they are unlikely to reflect cell-type proportion differences.
Assuntos
Metilação de DNA , Epigênese Genética , Estudo de Associação Genômica Ampla , Doenças Neurodegenerativas/etiologia , Alelos , Biomarcadores , Células Sanguíneas/metabolismo , Estudos de Casos e Controles , Suscetibilidade a Doenças , Perfilação da Expressão Gênica , Loci Gênicos , Predisposição Genética para Doença , Humanos , Doenças Neurodegenerativas/metabolismoRESUMO
BACKGROUND: Diverse microbiome communities drive biogeochemical processes and evolution of animals in their ecosystems. Many microbiome projects have demonstrated the power of using metagenomics to understand the structures and factors influencing the function of the microbiomes in their environments. In order to characterize the effects from microbiome composition for human health, diseases, and even ecosystems, one must first understand the relationship of microbes and their environment in different samples. Running machine learning model with metagenomic sequencing data is encouraged for this purpose, but it is not an easy task to make an appropriate machine learning model for all diverse metagenomic datasets. RESULTS: We introduce MegaR, an R Shiny package and web application, to build an unbiased machine learning model effortlessly with interactive visual analysis. The MegaR employs taxonomic profiles from either whole metagenome sequencing or 16S rRNA sequencing data to develop machine learning models and classify the samples into two or more categories. It provides various options for model fine tuning throughout the analysis pipeline such as data processing, multiple machine learning techniques, model validation, and unknown sample prediction that can be used to achieve the highest prediction accuracy possible for any given dataset while still maintaining a user-friendly experience. CONCLUSIONS: Metagenomic sample classification and phenotype prediction is important particularly when it applies to a diagnostic method for identifying and predicting microbe-related human diseases. MegaR provides various interactive visualizations for user to build an accurate machine-learning model without difficulty. Unknown sample prediction with a properly trained model using MegaR will enhance researchers to identify the sample property in a fast turnaround time.
Assuntos
Aprendizado de Máquina , Metagenoma , Metagenômica , Humanos , Fenótipo , RNA Ribossômico 16S/genéticaRESUMO
BACKGROUND: In recent years, to investigate challenging bioinformatics problems, the utilization of multiple genomic and proteomic sources has become immensely popular among researchers. One such issue is feature or gene selection and identifying relevant and non-redundant marker genes from high dimensional gene expression data sets. In that context, designing an efficient feature selection algorithm exploiting knowledge from multiple potential biological resources may be an effective way to understand the spectrum of cancer or other diseases with applications in specific epidemiology for a particular population. RESULTS: In the current article, we design the feature selection and marker gene detection as a multi-view multi-objective clustering problem. Regarding that, we propose an Unsupervised Multi-View Multi-Objective clustering-based gene selection approach called UMVMO-select. Three important resources of biological data (gene ontology, protein interaction data, protein sequence) along with gene expression values are collectively utilized to design two different views. UMVMO-select aims to reduce gene space without/minimally compromising the sample classification efficiency and determines relevant and non-redundant gene markers from three cancer gene expression benchmark data sets. CONCLUSION: A thorough comparative analysis has been performed with five clustering and nine existing feature selection methods with respect to several internal and external validity metrics. Obtained results reveal the supremacy of the proposed method. Reported results are also validated through a proper biological significance test and heatmap plotting.
Assuntos
Algoritmos , Marcadores Genéticos/genética , Análise por Conglomerados , Bases de Dados Genéticas , Ontologia Genética , Humanos , Neoplasias/genética , Neoplasias/patologia , Mapas de Interação de ProteínasRESUMO
Neural networks based on memristive devices have achieved great progress recently. However, memristive synapses with nonlinearity and asymmetry seriously limit the classification accuracy. Moreover, insufficient number of training samples in many cases also have negative effect on the classification accuracy of neural networks due to overfitting. In this work, dropout neuronal units are developed based on stochastic volatile memristive devices of Ag/Ta2O5:Ag/Pt. The memristive neural network using the dropout neuronal units effectively solves the problem of overfitting and mitigates the negative effects of the nonideality of memristive synapses, eventually achieves a classification accuracy comparable to the theoretical limit. The stochastic and volatile switching performances of the Ag/Ta2O5:Ag/Pt device are attributed to the stochastical rupture of the Ag filament under high electrical stress in the Ta2O5 layer, according to the TEM observation and the kinetic Monte Carlo simulation.
RESUMO
BACKGROUND: In the field of computational biology, analyzing complex data helps to extract relevant biological information. Sample classification of gene expression data is one such popular bio-data analysis technique. However, the presence of a large number of irrelevant/redundant genes in expression data makes a sample classification algorithm working inefficiently. Feature selection is one such high-dimensionality reduction technique that helps to maximize the effectiveness of any sample classification algorithm. Recent advances in biotechnology have improved the biological data to include multi-modal or multiple views. Different 'omics' resources capture various equally important biological properties of entities. However, most of the existing feature selection methodologies are biased towards considering only one out of multiple biological resources. Consequently, some crucial aspects of available biological knowledge may get ignored, which could further improve feature selection efficiency. RESULTS: In this present work, we have proposed a Consensus Multi-View Multi-objective Clustering-based feature selection algorithm called CMVMC. Three controlled genomic and proteomic resources like gene expression, Gene Ontology (GO), and protein-protein interaction network (PPIN) are utilized to build two independent views. The concept of multi-objective consensus clustering has been applied within our proposed gene selection method to satisfy both incorporated views. Gene expression data sets of Multiple tissues and Yeast from two different organisms (Homo Sapiens and Saccharomyces cerevisiae, respectively) are chosen for experimental purposes. As the end-product of CMVMC, a reduced set of relevant and non-redundant genes are found for each chosen data set. These genes finally participate in an effective sample classification. CONCLUSIONS: The experimental study on chosen data sets shows that our proposed feature-selection method improves the sample classification accuracy and reduces the gene-space up to a significant level. In the case of Multiple Tissues data set, CMVMC reduces the number of genes (features) from 5565 to 41, with 92.73% of sample classification accuracy. For Yeast data set, the number of genes got reduced to 10 from 2884, with 95.84% sample classification accuracy. Two internal cluster validity indices - Silhouette and Davies-Bouldin (DB) and one external validity index Classification Accuracy (CA) are chosen for comparative study. Reported results are further validated through well-known biological significance test and visualization tool.
Assuntos
Algoritmos , Análise por Conglomerados , Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Proteômica/métodos , HumanosRESUMO
Cancer proteomics has become a powerful technique for characterizing the protein markers driving transformation of malignancy, tracing proteome variation triggered by therapeutics, and discovering the novel targets and drugs for the treatment of oncologic diseases. To facilitate cancer diagnosis/prognosis and accelerate drug target discovery, a variety of methods for tumor marker identification and sample classification have been developed and successfully applied to cancer proteomic studies. This review article describes the most recent advances in those various approaches together with their current applications in cancer-related studies. Firstly, a number of popular feature selection methods are overviewed with objective evaluation on their advantages and disadvantages. Secondly, these methods are grouped into three major classes based on their underlying algorithms. Finally, a variety of sample separation algorithms are discussed. This review provides a comprehensive overview of the advances on tumor maker identification and patients/samples/tissues separations, which could be guidance to the researches in cancer proteomics.
RESUMO
Electroanalysis of myoglobin (Mb) in 10 plasma samples of healthy donors (HDs) and 14 plasma samples of patients with acute myocardial infarction (AMI) was carried out with screen-printed electrodes modified first with multi-walled carbon nanotubes (MWCNT) and then with a molecularly imprinted polymer film (MIP), viz., myoglobin-imprinted electropolymerized poly(o-phenylenediamine). The differential pulse voltammetry (DPV) parameters, such as a maximum amplitude of reduction peak current (A, nA), a reduction peak area (S, nA × V), and a peak potential (P, V), were measured for the MWCNT/MIP-sensors after their incubation with non-diluted plasma. The relevance of the multi-parameter electrochemical data for accurate discrimination between HDs and patients with AMI was assessed on the basis of electrochemical threshold values (this requires the reference standard method (RAMP® immunoassay)) or alternatively on the basis of the computational cluster assay (this does not require any reference standard method). The multi-parameter electrochemical analysis of biosamples combined with computational cluster assay was found to provide better accuracy in classification of plasma samples to the groups of HDs or AMI patients.
Assuntos
Técnicas Biossensoriais , Infarto do Miocárdio/sangue , Mioglobina/sangue , Nanotubos de Carbono/química , Técnicas Eletroquímicas , Humanos , Impressão Molecular , Mioglobina/isolamento & purificação , Fenilenodiaminas/químicaRESUMO
BACKGROUND: Classification of biological samples of gene expression data is a basic building block in solving several problems in the field of bioinformatics like cancer and other disease diagnosis and making a proper treatment plan. One big challenge in sample classification is handling large dimensional and redundant gene expression data. To reduce the complexity of handling this high dimensional data, gene/feature selection plays a major role. RESULTS: The current paper explores the use of biological knowledge acquired from Gene Ontology database in selecting the proper subset of genes which can further participate in clustering of samples. The proposed feature selection technique is unsupervised in nature as it does not utilize any class label information in the process of gene selection. At the end, a multi-objective clustering approach is deployed to cluster the available set of samples in the reduced gene space. CONCLUSIONS: Reported results show that consideration of biological knowledge in gene selection technique not only reduces the feature space dimensionality in great extent but also improves the accuracy of sample classification. The obtained reduced gene space is validated using strong biological significance tests. In order to prove the supremacy of our proposed gene selection based sample clustering technique, a thorough comparative analysis has also been performed with state-of-the-art techniques.
Assuntos
Biologia Computacional/métodos , Genes , Algoritmos , Análise por Conglomerados , Perfilação da Expressão Gênica , Ontologia Genética , Humanos , Anotação de Sequência Molecular , Especificidade de Órgãos/genética , Saccharomyces cerevisiae/genéticaRESUMO
Microarray technology allows simultaneous measurement of the expression levels of thousands of genes within a biological tissue sample. The fundamental power of microarrays lies within the ability to conduct parallel surveys of gene expression using microarray data. The classification of tissue samples based on gene expression data is an important problem in medical diagnosis of diseases such as cancer. In gene expression data, the number of genes is usually very high compared to the number of data samples. Thus the difficulty that lies with data are of high dimensionality and the sample size is small. This research work addresses the problem by classifying resultant dataset using the existing algorithms such as Support Vector Machine (SVM), K-nearest neighbor (KNN), Interval Valued Classification (IVC) and the improvised Interval Value based Particle Swarm Optimization (IVPSO) algorithm. Thus the results show that the IVPSO algorithm outperformed compared with other algorithms under several performance evaluation functions.