RESUMEN
MOTIVATION: Collaborative analysis of massive imaging datasets is essential to enable scientific discoveries. RESULTS: We developed Cytomine to foster active and distributed collaboration of multidisciplinary teams for large-scale image-based studies. It uses web development methodologies and machine learning in order to readily organize, explore, share and analyze (semantically and quantitatively) multi-gigapixel imaging data over the internet. We illustrate how it has been used in several biomedical applications. AVAILABILITY AND IMPLEMENTATION: Cytomine (http://www.cytomine.be/) is freely available under an open-source license from http://github.com/cytomine/ A documentation wiki (http://doc.cytomine.be) and a demo server (http://demo.cytomine.be) are also available. CONTACT: info@cytomine.be SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Interpretación de Imagen Asistida por Computador , Estadística como Asunto , Internet , Programas InformáticosRESUMEN
BACKGROUND: Research in epistasis or gene-gene interaction detection for human complex traits has grown over the last few years. It has been marked by promising methodological developments, improved translation efforts of statistical epistasis to biological epistasis and attempts to integrate different omics information sources into the epistasis screening to enhance power. The quest for gene-gene interactions poses severe multiple-testing problems. In this context, the maxT algorithm is one technique to control the false-positive rate. However, the memory needed by this algorithm rises linearly with the amount of hypothesis tests. Gene-gene interaction studies will require a memory proportional to the squared number of SNPs. A genome-wide epistasis search would therefore require terabytes of memory. Hence, cache problems are likely to occur, increasing the computation time. In this work we present a new version of maxT, requiring an amount of memory independent from the number of genetic effects to be investigated. This algorithm was implemented in C++ in our epistasis screening software MBMDR-3.0.3. We evaluate the new implementation in terms of memory efficiency and speed using simulated data. The software is illustrated on real-life data for Crohn's disease. RESULTS: In the case of a binary (affected/unaffected) trait, the parallel workflow of MBMDR-3.0.3 analyzes all gene-gene interactions with a dataset of 100,000 SNPs typed on 1000 individuals within 4 days and 9 hours, using 999 permutations of the trait to assess statistical significance, on a cluster composed of 10 blades, containing each four Quad-Core AMD Opteron(tm) Processor 2352 2.1 GHz. In the case of a continuous trait, a similar run takes 9 days. Our program found 14 SNP-SNP interactions with a multiple-testing corrected p-value of less than 0.05 on real-life Crohn's disease (CD) data. CONCLUSIONS: Our software is the first implementation of the MB-MDR methodology able to solve large-scale SNP-SNP interactions problems within a few days, without using much memory, while adequately controlling the type I error rates. A new implementation to reach genome-wide epistasis screening is under construction. In the context of Crohn's disease, MBMDR-3.0.3 could identify epistasis involving regions that are well known in the field and could be explained from a biological point of view. This demonstrates the power of our software to find relevant phenotype-genotype higher-order associations.
Asunto(s)
Algoritmos , Epistasis Genética , Programas Informáticos , Enfermedad de Crohn/genética , Estudios de Asociación Genética , Humanos , Modelos Genéticos , Polimorfismo de Nucleótido SimpleRESUMEN
MOTIVATION: Univariate statistical tests are widely used for biomarker discovery in bioinformatics. These procedures are simple, fast and their output is easily interpretable by biologists but they can only identify variables that provide a significant amount of information in isolation from the other variables. As biological processes are expected to involve complex interactions between variables, univariate methods thus potentially miss some informative biomarkers. Variable relevance scores provided by machine learning techniques, however, are potentially able to highlight multivariate interacting effects, but unlike the p-values returned by univariate tests, these relevance scores are usually not statistically interpretable. This lack of interpretability hampers the determination of a relevance threshold for extracting a feature subset from the rankings and also prevents the wide adoption of these methods by practicians. RESULTS: We evaluated several, existing and novel, procedures that extract relevant features from rankings derived from machine learning approaches. These procedures replace the relevance scores with measures that can be interpreted in a statistical way, such as p-values, false discovery rates, or family wise error rates, for which it is easier to determine a significance level. Experiments were performed on several artificial problems as well as on real microarray datasets. Although the methods differ in terms of computing times and the tradeoff, they achieve in terms of false positives and false negatives, some of them greatly help in the extraction of truly relevant biomarkers and should thus be of great practical interest for biologists and physicians. As a side conclusion, our experiments also clearly highlight that using model performance as a criterion for feature selection is often counter-productive. AVAILABILITY AND IMPLEMENTATION: Python source codes of all tested methods, as well as the MATLAB scripts used for data simulation, can be found in the Supplementary Material.
Asunto(s)
Inteligencia Artificial , Biomarcadores/análisis , Biología Computacional/métodos , Interpretación Estadística de Datos , TranscriptomaRESUMEN
OBJECTIVE: Knee osteoarthritis (OA) is a heterogeneous, complex joint pathology of unknown aetiology. Biomarkers have been widely used to investigate OA but currently available biomarkers lack specificity and sensitivity. Therefore, novel biomarkers are needed to better understand the pathophysiological processes of OA initiation and progression. METHODS: Surface enhanced laser desorption/ionisation-time of flight-mass spectrometry proteomic technique was used to analyse protein expression levels in 284 serum samples from patients with knee OA classified according to Kellgren and Lawrence (K&L) score (0-4). OA serum samples were also compared to serum samples provided by healthy individuals (negative control subjects; NC; n=36) and rheumatoid arthritis (RA) patients (n=25). Proteins that gave similar signal in all K&L groups of OA patients were ignored, whereas proteins with increased or decreased levels of expression were selected for further studies. RESULTS: Two proteins were found to be expressed at higher levels in sera of OA patients at all four K&L scores compared to NC and RA, and were identified as V65 vitronectin fragment and C3fpeptide. Of the two remaining proteins, one showed increased expression (unknown protein at m/z of 3762) and the other (identified as connective tissue-activating peptide III protein) was decreased in K&L scores >2 subsets compared to NC, RA and K&L scores 0 or 1 subsets. CONCLUSION: The authors detected four unexpected biomarkers (V65 vitronectin fragment, C3f peptide, CTAP-III and m/z 3762 protein) that could be relevant in the pathophysiological process of OA as having significant correlation with parameters reflecting local inflammation and bone remodelling, as well as decrease in cartilage turnover.
Asunto(s)
Proteínas Sanguíneas/análisis , Osteoartritis de la Rodilla/sangre , Adulto , Anciano , Anciano de 80 o más Años , Artritis Reumatoide/sangre , Artritis Reumatoide/diagnóstico , Biomarcadores/análisis , Biomarcadores/sangre , Estudios de Casos y Controles , Diagnóstico Diferencial , Femenino , Humanos , Masculino , Persona de Mediana Edad , Osteoartritis de la Rodilla/diagnóstico , Osteoartritis de la Rodilla/metabolismo , Fragmentos de Péptidos/análisis , Fragmentos de Péptidos/sangre , Proteómica/métodos , Reproducibilidad de los Resultados , Espectrometría de Masa por Láser de Matriz Asistida de Ionización Desorción/métodos , Líquido Sinovial/químicaRESUMEN
BACKGROUND: A large variety of proteins involved in inflammation, coagulation, lipid-oxidation and lipid metabolism have been associated with high-density lipoprotein (HDL) and it is anticipated that changes in the HDL proteome have implications for the multiple functions of HDL. Here, SELDI-TOF mass spectrometry (MS) was used to study the dynamic changes of HDL protein composition in a human experimental low-dose endotoxemia model. Ten healthy men with low HDL cholesterol (0.7+/-0.1 mmol/L) and 10 men with high HDL cholesterol levels (1.9+/-0.4 mmol/L) were challenged with endotoxin (LPS) intravenously (1 ng/kg bodyweight). We previously showed that subjects with low HDL cholesterol are more susceptible to an inflammatory challenge. The current study tested the hypothesis that this discrepancy may be related to differences in the HDL proteome. RESULTS: Plasma drawn at 7 time-points over a 24 hour time period after LPS challenge was used for direct capture of HDL using antibodies against apolipoprotein A-I followed by subsequent SELDI-TOF MS profiling. Upon LPS administration, profound changes in 21 markers (adjusted p-value < 0.05) were observed in the proteome in both study groups. These changes were observed 1 hour after LPS infusion and sustained up to 24 hours, but unexpectedly were not different between the 2 study groups. Hierarchical clustering of the protein spectra at all time points of all individuals revealed 3 distinct clusters, which were largely independent of baseline HDL cholesterol levels but correlated with paraoxonase 1 activity. The acute phase protein serum amyloid A-1/2 (SAA-1/2) was clearly upregulated after LPS infusion in both groups and comprised both native and N-terminal truncated variants that were identified by two-dimensional gel electrophoresis and mass spectrometry. Individuals of one of the clusters were distinguished by a lower SAA-1/2 response after LPS challenge and a delayed time-response of the truncated variants. CONCLUSIONS: This study shows that the semi-quantitative differences in the HDL proteome as assessed by SELDI-TOF MS cannot explain why subjects with low HDL cholesterol are more susceptible to a challenge with LPS than those with high HDL cholesterol. Instead the results indicate that hierarchical clustering could be useful to predict HDL functionality in acute phase responses towards LPS.
RESUMEN
Asthma is a complex inflammatory disease of airways. A network of reciprocal interactions between inflammatory cells, peptidic mediators, extracellular matrix components, and proteases is thought to be involved in the installation and maintenance of asthma-related airway inflammation and remodeling. To date, new proteic mediators displaying significant activity in the pathophysiology of asthma are still to be unveiled. The main objective of this study was to uncover potential target proteins by using surface-enhanced laser desorption/ionization-time of flight-mass spectrometry (SELDI-TOF-MS) on lung samples from mouse models of allergen-induced airway inflammation and remodeling. In this model, we pointed out several protein or peptide peaks that were preferentially expressed in diseased mice as compared to controls. We report the identification of different five proteins: found inflammatory zone 1 or RELM alpha (FIZZ-1), calcyclin (S100A6), clara cell secretory protein 10 (CC10), Ubiquitin, and Histone H4.
Asunto(s)
Asma/metabolismo , Biomarcadores/metabolismo , Bronquios/metabolismo , Inflamación/metabolismo , Animales , Asma/fisiopatología , Bronquios/química , Bronquios/fisiopatología , Proteínas de Ciclo Celular/metabolismo , Histonas/metabolismo , Inflamación/fisiopatología , Péptidos y Proteínas de Señalización Intercelular/metabolismo , Espectrometría de Masas , Ratones , Análisis por Matrices de Proteínas , Proteína A6 de Unión a Calcio de la Familia S100 , Proteínas S100/metabolismo , Ubiquitina/metabolismo , Uteroglobina/metabolismoRESUMEN
BACKGROUND: With the improvements in biosensors and high-throughput image acquisition technologies, life science laboratories are able to perform an increasing number of experiments that involve the generation of a large amount of images at different imaging modalities/scales. It stresses the need for computer vision methods that automate image classification tasks. RESULTS: We illustrate the potential of our image classification method in cell biology by evaluating it on four datasets of images related to protein distributions or subcellular localizations, and red-blood cell shapes. Accuracy results are quite good without any specific pre-processing neither domain knowledge incorporation. The method is implemented in Java and available upon request for evaluation and research purpose. CONCLUSION: Our method is directly applicable to any image classification problems. We foresee the use of this automatic approach as a baseline method and first try on various biological image classification problems.
Asunto(s)
Sistemas de Administración de Bases de Datos , Interpretación de Imagen Asistida por Computador/métodos , Procesamiento de Imagen Asistido por Computador/métodos , Reconocimiento de Normas Patrones Automatizadas , Programas Informáticos , Algoritmos , Eritrocitos , Células HeLa , Humanos , Almacenamiento y Recuperación de la Información , Variaciones Dependientes del Observador , Desprendimiento de Retina , TecnologíaRESUMEN
Crohn's disease and ulcerative colitis known as inflammatory bowel diseases (IBD) are chronic immuno-inflammatory pathologies of the gastrointestinal tract. These diseases are multifactorial, polygenic and of unknown etiology. Clinical presentation is non-specific and diagnosis is based on clinical, endoscopic, radiological and histological criteria. Novel markers are needed to improve early diagnosis and classification of these pathologies. We performed a study with 120 serum samples collected from patients classified in 4 groups (30 Crohn, 30 ulcerative colitis, 30 inflammatory controls and 30 healthy controls) according to accredited criteria. We compared protein sera profiles obtained with a Surface Enhanced Laser Desorption Ionization-Time of Flight-Mass Spectrometer (SELDI-TOF-MS). Data analysis with univariate process and a multivariate statistical method based on multiple decision trees algorithms allowed us to select some potential biomarkers. Four of them were identified by mass spectrometry and antibody based methods. Multivariate analysis generated models that could classify samples with good sensitivity and specificity (minimum 80%) discriminating groups of patients. This analysis was used as a tool to classify peaks according to differences in level on spectra through the four categories of patients. Four biomarkers showing important diagnostic value were purified, identified (PF4, MRP8, FIBA and Hpalpha2) and two of these: PF4 and Hpalpha2 were detected in sera by classical methods. SELDI-TOF-MS technology and use of the multiple decision trees method led to protein biomarker patterns analysis and allowed the selection of potential individual biomarkers. Their downstream identification may reveal to be helpful for IBD classification and etiology understanding.
Asunto(s)
Biomarcadores/análisis , Enfermedades Inflamatorias del Intestino/diagnóstico , Proteómica/métodos , Transportadoras de Casetes de Unión a ATP/análisis , Humanos , Enfermedades Inflamatorias del Intestino/fisiopatología , Técnicas de Diagnóstico Molecular , Osteopontina/análisis , Factor Plaquetario 4/análisis , Sensibilidad y Especificidad , Espectrometría de Masa por Láser de Matriz Asistida de Ionización Desorción/métodosRESUMEN
Fine operating rules for security control and an automatic system for their online discovery were developed to adapt to the development of smart grids. The automatic system uses the real-time system state to determine critical flowgates, and then a continuation power flow-based security analysis is used to compute the initial transfer capability of critical flowgates. Next, the system applies the Monte Carlo simulations to expected short-term operating condition changes, feature selection, and a linear least squares fitting of the fine operating rules. The proposed system was validated both on an academic test system and on a provincial power system in China. The results indicated that the derived rules provide accuracy and good interpretability and are suitable for real-time power system security control. The use of high-performance computing systems enables these fine operating rules to be refreshed online every 15 min.
RESUMEN
Networks are ubiquitous in biology, and computational approaches have been largely investigated for their inference. In particular, supervised machine learning methods can be used to complete a partially known network by integrating various measurements. Two main supervised frameworks have been proposed: the local approach, which trains a separate model for each network node, and the global approach, which trains a single model over pairs of nodes. Here, we systematically investigate, theoretically and empirically, the exploitation of tree-based ensemble methods in the context of these two approaches for biological network inference. We first formalize the problem of network inference as a classification of pairs, unifying in the process homogeneous and bipartite graphs and discussing two main sampling schemes. We then present the global and the local approaches, extending the latter for the prediction of interactions between two unseen network nodes, and discuss their specializations to tree-based ensemble methods, highlighting their interpretability and drawing links with clustering techniques. Extensive computational experiments are carried out with these methods on various biological networks that clearly highlight that these methods are competitive with existing methods.
Asunto(s)
Biología Computacional/métodos , Aprendizaje Automático , Biología de Sistemas/métodos , Algoritmos , Inteligencia Artificial , Simulación por Computador , Transducción de Señal/genéticaRESUMEN
BACKGROUND: The purpose of the MaxT algorithm is to provide a significance test algorithm that controls the family-wise error rate (FWER) during simultaneous hypothesis testing. However, the requirements in terms of computing time and memory of this procedure are proportional to the number of investigated hypotheses. The memory issue has been solved in 2013 by Van Lishout's implementation of MaxT, which makes the memory usage independent from the size of the dataset. This algorithm is implemented in MBMDR-3.0.3, a software that is able to identify genetic interactions, for a variety of SNP-SNP based epistasis models effectively. On the other hand, that implementation turned out to be less suitable for genome-wide interaction analysis studies, due to the prohibitive computational burden. RESULTS: In this work we introduce gammaMAXT, a novel implementation of the maxT algorithm for multiple testing correction. The algorithm was implemented in software MBMDR-4.2.2, as part of the MB-MDR framework to screen for SNP-SNP, SNP-environment or SNP-SNP-environment interactions at a genome-wide level. We show that, in the absence of interaction effects, test-statistics produced by the MB-MDR methodology follow a mixture distribution with a point mass at zero and a shifted gamma distribution for the top 10 % of the strictly positive values. We show that the gammaMAXT algorithm has a power comparable to MaxT and maintains FWER, but requires less computational resources and time. We analyze a dataset composed of 10(6) SNPs and 1000 individuals within one day on a 256-core computer cluster. The same analysis would take about 10(4) times longer with MBMDR-3.0.3. CONCLUSIONS: These results are promising for future GWAIs. However, the proposed gammaMAXT algorithm offers a general significance assessment and multiple testing approach, applicable to any context that requires performing hundreds of thousands of tests. It offers new perspectives for fast and efficient permutation-based significance assessment in large-scale (integrated) omics studies.
RESUMEN
Zebrafish is increasingly used to assess biological properties of chemical substances and thus is becoming a specific tool for toxicological and pharmacological studies. The effects of chemical substances on embryo survival and development are generally evaluated manually through microscopic observation by an expert and documented by several typical photographs. Here, we present a methodology to automatically classify brightfield images of wildtype zebrafish embryos according to their defects by using an image analysis approach based on supervised machine learning. We show that, compared to manual classification, automatic classification results in 90 to 100% agreement with consensus voting of biological experts in nine out of eleven considered defects in 3 days old zebrafish larvae. Automation of the analysis and classification of zebrafish embryo pictures reduces the workload and time required for the biological expert and increases the reproducibility and objectivity of this classification.
Asunto(s)
Pez Cebra/fisiología , Amiodarona/farmacología , Animales , Automatización , Embrión no Mamífero/efectos de los fármacos , Embrión no Mamífero/fisiología , Procesamiento de Imagen Asistido por Computador , Larva/efectos de los fármacos , Larva/fisiología , Aprendizaje Automático , Fenotipo , Propranolol/farmacología , Pez Cebra/crecimiento & desarrolloRESUMEN
Teleost fish such as zebrafish (Danio rerio) are increasingly used for physiological, genetic and developmental studies. Our understanding of the physiological consequences of altered gravity in an entire organism is still incomplete. We used altered gravity and drug treatment experiments to evaluate their effects specifically on bone formation and more generally on whole genome gene expression. By combining morphometric tools with an objective scoring system for the state of development for each element in the head skeleton and specific gene expression analysis, we confirmed and characterized in detail the decrease or increase of bone formation caused by a 5 day treatment (from 5dpf to 10 dpf) of, respectively parathyroid hormone (PTH) or vitamin D3 (VitD3). Microarray transcriptome analysis after 24 hours treatment reveals a general effect on physiology upon VitD3 treatment, while PTH causes more specifically developmental effects. Hypergravity (3g from 5dpf to 9 dpf) exposure results in a significantly larger head and a significant increase in bone formation for a subset of the cranial bones. Gene expression analysis after 24 hrs at 3g revealed differential expression of genes involved in the development and function of the skeletal, muscular, nervous, endocrine and cardiovascular systems. Finally, we propose a novel type of experimental approach, the "Reduced Gravity Paradigm", by keeping the developing larvae at 3g hypergravity for the first 5 days before returning them to 1g for one additional day. 5 days exposure to 3g during these early stages also caused increased bone formation, while gene expression analysis revealed a central network of regulatory genes (hes5, sox10, lgals3bp, egr1, edn1, fos, fosb, klf2, gadd45ba and socs3a) whose expression was consistently affected by the transition from hyper- to normal gravity.
Asunto(s)
Gravitación , Hormonas/fisiología , Pez Cebra/anatomía & histología , Pez Cebra/fisiología , Animales , Desarrollo Embrionario , Pez Cebra/embriologíaRESUMEN
The primary goal of genome-wide association studies (GWAS) is to discover variants that could lead, in isolation or in combination, to a particular trait or disease. Standard approaches to GWAS, however, are usually based on univariate hypothesis tests and therefore can account neither for correlations due to linkage disequilibrium nor for combinations of several markers. To discover and leverage such potential multivariate interactions, we propose in this work an extension of the Random Forest algorithm tailored for structured GWAS data. In terms of risk prediction, we show empirically on several GWAS datasets that the proposed T-Trees method significantly outperforms both the original Random Forest algorithm and standard linear models, thereby suggesting the actual existence of multivariate non-linear effects due to the combinations of several SNPs. We also demonstrate that variable importances as derived from our method can help identify relevant loci. Finally, we highlight the strong impact that quality control procedures may have, both in terms of predictive power and loci identification. Variable importance results and T-Trees source code are all available at www.montefiore.ulg.ac.be/~botta/ttrees/ and github.com/0asa/TTree-source respectively.
Asunto(s)
Polimorfismo de Nucleótido Simple/genética , Algoritmos , Sitios Genéticos/genética , Estudio de Asociación del Genoma Completo/métodos , Humanos , Modelos Lineales , Desequilibrio de Ligamiento/genética , Modelos Genéticos , Riesgo , Programas InformáticosRESUMEN
Disordered regions, i.e., regions of proteins that do not adopt a stable three-dimensional structure, have been shown to play various and critical roles in many biological processes. Predicting and understanding their formation is therefore a key sub-problem of protein structure and function inference. A wide range of machine learning approaches have been developed to automatically predict disordered regions of proteins. One key factor of the success of these methods is the way in which protein information is encoded into features. Recently, we have proposed a systematic methodology to study the relevance of various feature encodings in the context of disulfide connectivity pattern prediction. In the present paper, we adapt this methodology to the problem of predicting disordered regions and assess it on proteins from the 10th CASP competition, as well as on a very large subset of proteins extracted from PDB. Our results, obtained with ensembles of extremely randomized trees, highlight a novel feature function encoding the proximity of residues according to their accessibility to the solvent, which is playing the second most important role in the prediction of disordered regions, just after evolutionary information. Furthermore, even though our approach treats each residue independently, our results are very competitive in terms of accuracy with respect to the state-of-the-art. A web-application is available at http://m24.giga.ulg.ac.be:81/x3Disorder.
Asunto(s)
Modelos Moleculares , Conformación Proteica , Análisis de Secuencia de Proteína/métodos , Algoritmos , Simulación por Computador , Programas InformáticosRESUMEN
Disulfide bridges strongly constrain the native structure of many proteins and predicting their formation is therefore a key sub-problem of protein structure and function inference. Most recently proposed approaches for this prediction problem adopt the following pipeline: first they enrich the primary sequence with structural annotations, second they apply a binary classifier to each candidate pair of cysteines to predict disulfide bonding probabilities and finally, they use a maximum weight graph matching algorithm to derive the predicted disulfide connectivity pattern of a protein. In this paper, we adopt this three step pipeline and propose an extensive study of the relevance of various structural annotations and feature encodings. In particular, we consider five kinds of structural annotations, among which three are novel in the context of disulfide bridge prediction. So as to be usable by machine learning algorithms, these annotations must be encoded into features. For this purpose, we propose four different feature encodings based on local windows and on different kinds of histograms. The combination of structural annotations with these possible encodings leads to a large number of possible feature functions. In order to identify a minimal subset of relevant feature functions among those, we propose an efficient and interpretable feature function selection scheme, designed so as to avoid any form of overfitting. We apply this scheme on top of three supervised learning algorithms: k-nearest neighbors, support vector machines and extremely randomized trees. Our results indicate that the use of only the PSSM (position-specific scoring matrix) together with the CSP (cysteine separation profile) are sufficient to construct a high performance disulfide pattern predictor and that extremely randomized trees reach a disulfide pattern prediction accuracy of [Formula: see text] on the benchmark dataset SPX[Formula: see text], which corresponds to [Formula: see text] improvement over the state of the art. A web-application is available at http://m24.giga.ulg.ac.be:81/x3CysBridges.
Asunto(s)
Algoritmos , Disulfuros/química , Modelos Químicos , Conformación Proteica , Proteínas/química , Inteligencia ArtificialRESUMEN
In this paper, we consider the batch mode reinforcement learning setting, where the central problem is to learn from a sample of trajectories a policy that satisfies or optimizes a performance criterion. We focus on the continuous state space case for which usual resolution schemes rely on function approximators either to represent the underlying control problem or to represent its value function. As an alternative to the use of function approximators, we rely on the synthesis of "artificial trajectories" from the given sample of trajectories, and show that this idea opens new avenues for designing and analyzing algorithms for batch mode reinforcement learning.
RESUMEN
Predicting a particular cognitive state from a specific pattern of fMRI voxel values is still a methodological challenge. Decoding brain activity is usually performed in highly controlled experimental paradigms characterized by a series of distinct states induced by a temporally constrained experimental design. In more realistic conditions, the number, sequence and duration of mental states are unpredictably generated by the individual, resulting in complex and imbalanced fMRI data sets. This study tests the classification of brain activity, acquired on 16 volunteers using fMRI, during mental imagery, a condition in which the number and duration of mental events were not externally imposed but self-generated. To deal with these issues, two classification techniques were considered (Support Vector Machines, SVM, and Gaussian Processes, GP), as well as different feature extraction methods (General Linear Model, GLM and SVM). These techniques were combined in order to identify the procedures leading to the highest accuracy measures. Our results showed that 12 data sets out of 16 could be significantly modeled by either SVM or GP. Model accuracies tended to be related to the degree of imbalance between classes and to task performance of the volunteers. We also conclude that the GP technique tends to be more robust than SVM to model unbalanced data sets.
Asunto(s)
Encéfalo/diagnóstico por imagen , Máquina de Vectores de Soporte , Adulto , Cognición , Femenino , Humanos , Interpretación de Imagen Asistida por Computador , Imagen por Resonancia Magnética , Masculino , Distribución Normal , Cintigrafía , Programas Informáticos , Adulto JovenRESUMEN
One of the pressing open problems of computational systems biology is the elucidation of the topology of genetic regulatory networks (GRNs) using high throughput genomic data, in particular microarray gene expression data. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) challenge aims to evaluate the success of GRN inference algorithms on benchmarks of simulated data. In this article, we present GENIE3, a new algorithm for the inference of GRNs that was best performer in the DREAM4 In Silico Multifactorial challenge. GENIE3 decomposes the prediction of a regulatory network between p genes into p different regression problems. In each of the regression problems, the expression pattern of one of the genes (target gene) is predicted from the expression patterns of all the other genes (input genes), using tree-based ensemble methods Random Forests or Extra-Trees. The importance of an input gene in the prediction of the target gene expression pattern is taken as an indication of a putative regulatory link. Putative regulatory links are then aggregated over all genes to provide a ranking of interactions from which the whole network is reconstructed. In addition to performing well on the DREAM4 In Silico Multifactorial challenge simulated data, we show that GENIE3 compares favorably with existing algorithms to decipher the genetic regulatory network of Escherichia coli. It doesn't make any assumption about the nature of gene regulation, can deal with combinatorial and non-linear interactions, produces directed GRNs, and is fast and scalable. In conclusion, we propose a new algorithm for GRN inference that performs well on both synthetic and real gene expression data. The algorithm, based on feature selection with tree-based ensemble methods, is simple and generic, making it adaptable to other types of genomic data and interactions.
Asunto(s)
Biología Computacional/métodos , Regulación de la Expresión Génica , Redes Reguladoras de Genes , Algoritmos , Escherichia coli/genética , Análisis de Secuencia por Matrices de OligonucleótidosRESUMEN
At the intersection between artificial intelligence and statistics, supervised learning allows algorithms to automatically build predictive models from just observations of a system. During the last twenty years, supervised learning has been a tool of choice to analyze the always increasing and complexifying data generated in the context of molecular biology, with successful applications in genome annotation, function prediction, or biomarker discovery. Among supervised learning methods, decision tree-based methods stand out as non parametric methods that have the unique feature of combining interpretability, efficiency, and, when used in ensembles of trees, excellent accuracy. The goal of this paper is to provide an accessible and comprehensive introduction to this class of methods. The first part of the review is devoted to an intuitive but complete description of decision tree-based methods and a discussion of their strengths and limitations with respect to other supervised learning methods. The second part of the review provides a survey of their applications in the context of computational and systems biology.