Your browser doesn't support javascript.
loading
: 20 | 50 | 100
1 - 17 de 17
1.
Gigascience ; 112022 03 12.
Article En | MEDLINE | ID: mdl-35277963

BACKGROUND: Feature selection is a relevant step in the analysis of single-cell RNA sequencing datasets. Most of the current feature selection methods are based on general univariate descriptors of the data such as the dispersion or the percentage of zeros. Despite the use of correction methods, the generality of these feature selection methods biases the genes selected towards highly expressed genes, instead of the genes defining the cell populations of the dataset. RESULTS: Triku is a feature selection method that favors genes defining the main cell populations. It does so by selecting genes expressed by groups of cells that are close in the k-nearest neighbor graph. The expression of these genes is higher than the expected expression if the k-cells were chosen at random. Triku efficiently recovers cell populations present in artificial and biological benchmarking datasets, based on adjusted Rand index, normalized mutual information, supervised classification, and silhouette coefficient measurements. Additionally, gene sets selected by triku are more likely to be related to relevant Gene Ontology terms and contain fewer ribosomal and mitochondrial genes. CONCLUSION: Triku is developed in Python 3 and is available at https://github.com/alexmascension/triku.


Algorithms , Benchmarking , Cluster Analysis
2.
Data Brief ; 18: 840-845, 2018 Jun.
Article En | MEDLINE | ID: mdl-29900248

Classifying software defects according to any defined taxonomy is not straightforward. In order to be used for automatizing the classification of software defects, two sets of defect reports were collected from public issue tracking systems from two different real domains. Due to the lack of a domain expert, the collected defects were categorized by a set of annotators of unknown reliability according to their impact from IBM's orthogonal defect classification taxonomy. Both datasets are prepared to solve the defect classification problem by means of techniques of the learning from crowds paradigm (Hernández-González et al. [1]). Two versions of both datasets are publicly shared. In the first version, the raw data is given: the text description of defects together with the category assigned by each annotator. In the second version, the text of each defect has been transformed to a descriptive vector using text-mining techniques.

3.
Stat Methods Med Res ; 27(4): 1056-1066, 2018 04.
Article En | MEDLINE | ID: mdl-27242336

Machine learning techniques have been previously used to assist clinicians to select embryos for human-assisted reproduction. This work aims to show how an appropriate modeling of the problem can contribute to improve machine learning techniques for embryo selection. In this study, a dataset of 330 consecutive cycles (and associated embryos) carried out by the Unit of Assisted Reproduction of the Hospital Donostia (Spain) throughout 18 months has been analyzed. The problem of the embryo selection has been modeled by a novel weakly supervised paradigm, learning from label proportions, which considers all the available data, including embryos whose fate cannot be certainly established. Furthermore, all the collected features, describing cycles and embryos, have been considered in a multi-variate data analysis. Our integral solution has been successfully tested. Experimental results show that the proposed technique consistently outperforms an equivalent approach based on standard supervised classification. Embryos in this study were selected for transference according to the criteria of the Spanish Association for Reproduction Biology Studies. Obtained classification models outperform these criteria, specifically reordering medium-quality embryos.


Embryo Implantation , Machine Learning , Bayes Theorem , Databases, Factual , Humans , Reproductive Techniques, Assisted , Spain
4.
IEEE Trans Neural Netw Learn Syst ; 27(12): 2602-2614, 2016 12.
Article En | MEDLINE | ID: mdl-26625427

In recent years, the performance of semisupervised learning (SSL) has been theoretically investigated. However, most of this theoretical development has focused on binary classification problems. In this paper, we take it a step further by extending the work of Castelli and Cover to the multiclass paradigm. In particular, we consider the key problem in SSL of classifying an unseen instance x into one of K different classes, using a training data set sampled from a mixture density distribution and composed of l labeled records and u unlabeled examples. Even under the assumption of identifiability of the mixture and having infinite unlabeled examples, labeled records are needed to determine the K decision regions. Therefore, in this paper, we first investigate the minimum number of labeled examples needed to accomplish that task. Then, we propose an optimal multiclass learning algorithm, which is a generalization of the optimal procedure proposed in the literature for binary problems. Finally, we make use of this generalization to study the probability of error when the binary class constraint is relaxed.

5.
Comput Methods Programs Biomed ; 112(3): 367-97, 2013 Dec.
Article En | MEDLINE | ID: mdl-24079964

BACKGROUND: One of the emerging techniques for performing the analysis of the DNA microarray data known as biclustering is the search of subsets of genes and conditions which are coherently expressed. These subgroups provide clues about the main biological processes. Until now, different approaches to this problem have been proposed. Most of them use the mean squared residue as quality measure but relevant and interesting patterns can not be detected such as shifting, or scaling patterns. Furthermore, recent papers show that there exist new coherence patterns involved in different kinds of cancer and tumors such as inverse relationships between genes which can not be captured. RESULTS: The proposed measure is called Spearman's biclustering measure (SBM) which performs an estimation of the quality of a bicluster based on the non-linear correlation among genes and conditions simultaneously. The search of biclusters is performed by using a evolutionary technique called estimation of distribution algorithms which uses the SBM measure as fitness function. This approach has been examined from different points of view by using artificial and real microarrays. The assessment process has involved the use of quality indexes, a set of bicluster patterns of reference including new patterns and a set of statistical tests. It has been also examined the performance using real microarrays and comparing to different algorithmic approaches such as Bimax, CC, OPSM, Plaid and xMotifs. CONCLUSIONS: SBM shows several advantages such as the ability to recognize more complex coherence patterns such as shifting, scaling and inversion and the capability to selectively marginalize genes and conditions depending on the statistical significance.


Gene Expression , Algorithms , Cluster Analysis
6.
BMC Cancer ; 12: 43, 2012 Jan 26.
Article En | MEDLINE | ID: mdl-22280244

BACKGROUND: Malignancies arising in the large bowel cause the second largest number of deaths from cancer in the Western World. Despite progresses made during the last decades, colorectal cancer remains one of the most frequent and deadly neoplasias in the western countries. METHODS: A genomic study of human colorectal cancer has been carried out on a total of 31 tumoral samples, corresponding to different stages of the disease, and 33 non-tumoral samples. The study was carried out by hybridisation of the tumour samples against a reference pool of non-tumoral samples using Agilent Human 1A 60-mer oligo microarrays. The results obtained were validated by qRT-PCR. In the subsequent bioinformatics analysis, gene networks by means of Bayesian classifiers, variable selection and bootstrap resampling were built. The consensus among all the induced models produced a hierarchy of dependences and, thus, of variables. RESULTS: After an exhaustive process of pre-processing to ensure data quality--lost values imputation, probes quality, data smoothing and intraclass variability filtering--the final dataset comprised a total of 8, 104 probes. Next, a supervised classification approach and data analysis was carried out to obtain the most relevant genes. Two of them are directly involved in cancer progression and in particular in colorectal cancer. Finally, a supervised classifier was induced to classify new unseen samples. CONCLUSIONS: We have developed a tentative model for the diagnosis of colorectal cancer based on a biomarker panel. Our results indicate that the gene profile described herein can discriminate between non-cancerous and cancerous samples with 94.45% accuracy using different supervised classifiers (AUC values in the range of 0.997 and 0.955).


Colorectal Neoplasms/diagnosis , Genetic Markers , Analysis of Variance , Bayes Theorem , Colorectal Neoplasms/genetics , Disease Progression , Gene Expression Profiling , Genetic Variation , Humans , Real-Time Polymerase Chain Reaction
7.
Article En | MEDLINE | ID: mdl-21393653

Progress is continuously being made in the quest for stable biomarkers linked to complex diseases. Mass spectrometers are one of the devices for tackling this problem. The data profiles they produce are noisy and unstable. In these profiles, biomarkers are detected as signal regions (peaks), where control and disease samples behave differently. Mass spectrometry (MS) data generally contain a limited number of samples described by a high number of features. In this work, we present a novel class of evolutionary algorithms, estimation of distribution algorithms (EDA), as an efficient peak selector in this MS domain. There is a trade-of f between the reliability of the detected biomarkers and the low number of samples for analysis. For this reason, we introduce a consensus approach, built upon the classical EDA scheme, that improves stability and robustness of the final set of relevant peaks. An entire data workflow is designed to yield unbiased results. Four publicly available MS data sets (two MALDI-TOF and another two SELDI-TOF) are analyzed. The results are compared to the original works, and a new plot (peak frequential plot) for graphically inspecting the relevant peaks is introduced. A complete online supplementary page, which can be found at http://www.sc.ehu.es/ccwbayes/members/ruben/ms, includes extended info and results, in addition to Matlab scripts and references.


Algorithms , Biomarkers/chemistry , Computational Biology/methods , Databases, Factual , Spectrometry, Mass, Matrix-Assisted Laser Desorption-Ionization/methods , Biomarkers/analysis , Carcinoma, Hepatocellular/chemistry , Humans , Liver Neoplasms/chemistry , Stochastic Processes
8.
Methods Mol Biol ; 593: 25-48, 2010.
Article En | MEDLINE | ID: mdl-19957143

The increase in the number and complexity of biological databases has raised the need for modern and powerful data analysis tools and techniques. In order to fulfill these requirements, the machine learning discipline has become an everyday tool in bio-laboratories. The use of machine learning techniques has been extended to a wide spectrum of bioinformatics applications. It is broadly used to investigate the underlying mechanisms and interactions between biological molecules in many diseases, and it is an essential tool in any biomarker discovery process. In this chapter, we provide a basic taxonomy of machine learning algorithms, and the characteristics of main data preprocessing, supervised classification, and clustering techniques are shown. Feature selection, classifier evaluation, and two supervised classification topics that have a deep impact on current bioinformatics are presented. We make the interested reader aware of a set of popular web resources, open source software tools, and benchmarking data repositories that are frequently used by the machine learning community.


Artificial Intelligence , Computational Biology/instrumentation , Computational Biology/methods , Cluster Analysis , Databases, Factual , Electronic Data Processing , Software
9.
PLoS One ; 4(7): e6309, 2009 Jul 20.
Article En | MEDLINE | ID: mdl-19617918

Differences in gene expression patterns have been documented not only in Multiple Sclerosis patients versus healthy controls but also in the relapse of the disease. Recently a new gene expression modulator has been identified: the microRNA or miRNA. The aim of this work is to analyze the possible role of miRNAs in multiple sclerosis, focusing on the relapse stage. We have analyzed the expression patterns of 364 miRNAs in PBMC obtained from multiple sclerosis patients in relapse status, in remission status and healthy controls. The expression patterns of the miRNAs with significantly different expression were validated in an independent set of samples. In order to determine the effect of the miRNAs, the expression of some predicted target genes of these were studied by qPCR. Gene interaction networks were constructed in order to obtain a co-expression and multivariate view of the experimental data. The data analysis and later validation reveal that two miRNAs (hsa-miR-18b and hsa-miR-599) may be relevant at the time of relapse and that another miRNA (hsa-miR-96) may be involved in remission. The genes targeted by hsa-miR-96 are involved in immunological pathways as Interleukin signaling and in other pathways as wnt signaling. This work highlights the importance of miRNA expression in the molecular mechanisms implicated in the disease. Moreover, the proposed involvement of these small molecules in multiple sclerosis opens up a new therapeutic approach to explore and highlight some candidate biomarker targets in MS.


MicroRNAs/genetics , Monocytes/metabolism , Multiple Sclerosis/blood , Case-Control Studies , Humans , Multiple Sclerosis/genetics , Polymerase Chain Reaction , Recurrence
10.
IEEE Trans Inf Technol Biomed ; 13(3): 341-50, 2009 May.
Article En | MEDLINE | ID: mdl-19423430

Microarray-based global gene expression profiling, with the use of sophisticated statistical algorithms is providing new insights into the pathogenesis of autoimmune diseases. We have applied a novel statistical technique for gene selection based on machine learning approaches to analyze microarray expression data gathered from patients with systemic lupus erythematosus (SLE) and primary antiphospholipid syndrome (PAPS), two autoimmune diseases of unknown genetic origin that share many common features. The methodology included a combination of three data discretization policies, a consensus gene selection method, and a multivariate correlation measurement. A set of 150 genes was found to discriminate SLE and PAPS patients from healthy individuals. Statistical validations demonstrate the relevance of this gene set from an univariate and multivariate perspective. Moreover, functional characterization of these genes identified an interferon-regulated gene signature, consistent with previous reports. It also revealed the existence of other regulatory pathways, including those regulated by PTEN, TNF, and BCL-2, which are altered in SLE and PAPS. Remarkably, a significant number of these genes carry E2F binding motifs in their promoters, projecting a role for E2F in the regulation of autoimmunity.


Antiphospholipid Syndrome/genetics , Artificial Intelligence , Gene Expression Profiling/methods , Lupus Erythematosus, Systemic/genetics , Oligonucleotide Array Sequence Analysis/methods , Analysis of Variance , Bayes Theorem , Cluster Analysis , Female , Gene Expression Regulation , Humans , Logistic Models , Models, Genetic , Reproducibility of Results , Reverse Transcriptase Polymerase Chain Reaction
11.
PLoS One ; 3(11): e3750, 2008.
Article En | MEDLINE | ID: mdl-19015733

Limb-girdle muscular dystrophy type 2A (LGMD2A) is a recessive genetic disorder caused by mutations in calpain 3 (CAPN3). Calpain 3 plays different roles in muscular cells, but little is known about its functions or in vivo substrates. The aim of this study was to identify the genes showing an altered expression in LGMD2A patients and the possible pathways they are implicated in. Ten muscle samples from LGMD2A patients with in which molecular diagnosis was ascertained were investigated using array technology to analyze gene expression profiling as compared to ten normal muscle samples. Upregulated genes were mostly those related to extracellular matrix (different collagens), cell adhesion (fibronectin), muscle development (myosins and melusin) and signal transduction. It is therefore suggested that different proteins located or participating in the costameric region are implicated in processes regulated by calpain 3 during skeletal muscle development. Genes participating in the ubiquitin proteasome degradation pathway were found to be deregulated in LGMD2A patients, suggesting that regulation of this pathway may be under the control of calpain 3 activity. As frizzled-related protein (FRZB) is upregulated in LGMD2A muscle samples, it could be hypothesized that beta-catenin regulation is also altered at the Wnt signaling pathway, leading to an incorrect myogenesis. Conversely, expression of most transcription factor genes was downregulated (MYC, FOS and EGR1). Finally, the upregulation of IL-32 and immunoglobulin genes may induce the eosinophil chemoattraction explaining the inflammatory findings observed in presymptomatic stages. The obtained results try to shed some light on identification of novel therapeutic targets for limb-girdle muscular dystrophies.


Gene Expression Profiling , Gene Expression Regulation , Muscular Dystrophies, Limb-Girdle/genetics , Muscular Dystrophies, Limb-Girdle/metabolism , Adult , Aged , Aged, 80 and over , Calpain/genetics , Female , Glycoproteins/metabolism , Humans , Interleukins/metabolism , Intracellular Signaling Peptides and Proteins , Male , Middle Aged , Muscle Proteins/genetics , Muscles/metabolism , Signal Transduction , Wnt Proteins/metabolism , beta Catenin/metabolism
12.
BioData Min ; 1(1): 6, 2008 Sep 11.
Article En | MEDLINE | ID: mdl-18822112

Evolutionary search algorithms have become an essential asset in the algorithmic toolbox for solving high-dimensional optimization problems in across a broad range of bioinformatics problems. Genetic algorithms, the most well-known and representative evolutionary search technique, have been the subject of the major part of such applications. Estimation of distribution algorithms (EDAs) offer a novel evolutionary paradigm that constitutes a natural and attractive alternative to genetic algorithms. They make use of a probabilistic model, learnt from the promising solutions, to guide the search process. In this paper, we set out a basic taxonomy of EDA techniques, underlining the nature and complexity of the probabilistic model of each EDA variant. We review a set of innovative works that make use of EDA techniques to solve challenging bioinformatics problems, emphasizing the EDA paradigm's potential for further research in this domain.

13.
Comput Methods Programs Biomed ; 91(2): 110-21, 2008 Aug.
Article En | MEDLINE | ID: mdl-18433926

The main purpose of a gene interaction network is to map the relationships of the genes that are out of sight when a genomic study is tackled. DNA microarrays allow the measure of gene expression of thousands of genes at the same time. These data constitute the numeric seed for the induction of the gene networks. In this paper, we propose a new approach to build gene networks by means of Bayesian classifiers, variable selection and bootstrap resampling. The interactions induced by the Bayesian classifiers are based both on the expression levels and on the phenotype information of the supervised variable. Feature selection and bootstrap resampling add reliability and robustness to the overall process removing the false positive findings. The consensus among all the induced models produces a hierarchy of dependences and, thus, of variables. Biologists can define the depth level of the model hierarchy so the set of interactions and genes involved can vary from a sparse to a dense set. Experimental results show how these networks perform well on classification tasks. The biological validation matches previous biological findings and opens new hypothesis for future studies.


Algorithms , Gene Expression Profiling/methods , Genes/physiology , Models, Biological , Oligonucleotide Array Sequence Analysis/methods , Pattern Recognition, Automated/methods , Proteome/metabolism , Signal Transduction/physiology , Bayes Theorem , Computer Simulation
14.
Bioinformatics ; 23(19): 2507-17, 2007 Oct 01.
Article En | MEDLINE | ID: mdl-17720704

Feature selection techniques have become an apparent need in many bioinformatics applications. In addition to the large pool of techniques that have already been developed in the machine learning and data mining fields, specific applications in bioinformatics have led to a wealth of newly proposed techniques. In this article, we make the interested reader aware of the possibilities of feature selection, providing a basic taxonomy of feature selection techniques, and discussing their use, variety and potential in a number of both common as well as upcoming bioinformatics applications.


Algorithms , Artificial Intelligence , Computational Biology/methods , Gene Expression Profiling/methods , Models, Biological , Pattern Recognition, Automated/methods , Sequence Analysis/methods , Computer Simulation
15.
Brief Bioinform ; 7(1): 86-112, 2006 Mar.
Article En | MEDLINE | ID: mdl-16761367

This article reviews machine learning methods for bioinformatics. It presents modelling methods, such as supervised classification, clustering and probabilistic graphical models for knowledge discovery, as well as deterministic and stochastic heuristics for optimization. Applications in genomics, proteomics, systems biology, evolution and text mining are also shown.


Artificial Intelligence , Computational Biology , Models, Theoretical , Genomics , Proteomics
16.
J Biomed Inform ; 38(5): 376-88, 2005 Oct.
Article En | MEDLINE | ID: mdl-15967731

The transjugular intrahepatic portosystemic shunt (TIPS) is a treatment for cirrhotic patients with portal hypertension. A subgroup of patients dies in the first 6 months and another subgroup lives a long period of time. Nowadays, no risk factors have been identified in order to determine how long a patient will survive. An empirical study for predicting the survival rate within the first 6 months after TIPS placement is conducted using a clinical database with 107 cases and 77 variables. Applications of Bayesian classification models, based on Bayesian networks, to medical problems have become popular in the last years. Feature subset selection is useful due to the heterogeneity of the medical databases where not all the variables are required to perform the classification. In this paper, filter and wrapper approaches based on the feature subset selection are adapted to induce Bayesian classifiers (naive Bayes, selective naive Bayes, semi naive Bayes, tree augmented naive Bayes, and k-dependence Bayesian classifier) and are applied to distinguish between the two subgroups of cirrhotic patients. The estimated accuracies obtained tally with the results of previous studies. Moreover, the medical significance of the subset of variables selected by the classifiers along with the comprehensibility of Bayesian models is greatly appreciated by physicians.


Expert Systems , Fibrosis/mortality , Fibrosis/surgery , Outcome Assessment, Health Care/methods , Portasystemic Shunt, Transjugular Intrahepatic/mortality , Risk Assessment/methods , Survival Analysis , Bayes Theorem , Comorbidity , Decision Support Systems, Clinical , Diagnosis, Computer-Assisted/methods , Humans , Incidence , Pattern Recognition, Automated/methods , Prognosis , ROC Curve , Retrospective Studies , Risk Factors , Survival Rate , Treatment Outcome , United States/epidemiology
17.
Artif Intell Med ; 31(2): 91-103, 2004 Jun.
Article En | MEDLINE | ID: mdl-15219288

DNA microarray experiments generating thousands of gene expression measurements, are used to collect information from tissue and cell samples regarding gene expression differences that could be useful for diagnosis disease, distinction of the specific tumor type, etc. One important application of gene expression microarray data is the classification of samples into known categories. As DNA microarray technology measures the gene expression en masse, this has resulted in data with the number of features (genes) far exceeding the number of samples. As the predictive accuracy of supervised classifiers that try to discriminate between the classes of the problem decays with the existence of irrelevant and redundant features, the necessity of a dimensionality reduction process is essential. We propose the application of a gene selection process, which also enables the biology researcher to focus on promising gene candidates that actively contribute to classification in these large scale microarrays. Two basic approaches for feature selection appear in machine learning and pattern recognition literature: the filter and wrapper techniques. Filter procedures are used in most of the works in the area of DNA microarrays. In this work, a comparison between a group of different filter metrics and a wrapper sequential search procedure is carried out. The comparison is performed in two well-known DNA microarray datasets by the use of four classic supervised classifiers. The study is carried out over the original-continuous and three-intervals discretized gene expression data. While two well-known filter metrics are proposed for continuous data, four classic filter measures are used over discretized data. The same wrapper approach is used for both continuous and discretized data. The application of filter and wrapper gene selection procedures leads to considerably better accuracy results in comparison to the non-gene selection approach, coupled with interesting and notable dimensionality reductions. Although the wrapper approach mainly shows a more accurate behavior than filter metrics, this improvement is coupled with considerable computer-load necessities. We note that most of the genes selected by proposed filter and wrapper procedures in discrete and continuous microarray data appear in the lists of relevant-informative genes detected by previous studies over these datasets. The aim of this work is to make contributions in the field of the gene selection task in DNA microarray datasets. By an extensive comparison with more popular filter techniques, we would like to make contributions in the expansion and study of the wrapper approach in this type of domains.


Artificial Intelligence , Gene Expression Profiling , Oligonucleotide Array Sequence Analysis/methods , Selection, Genetic , Databases, Genetic , Humans
...