RESUMO
MOTIVATION: Sepsis is a leading cause of death and disability in children globally, accounting for â¼3 million childhood deaths per year. In pediatric sepsis patients, the multiple organ dysfunction syndrome (MODS) is considered a significant risk factor for adverse clinical outcomes characterized by high mortality and morbidity in the pediatric intensive care unit. The recent rapidly growing availability of electronic health records (EHRs) has allowed researchers to vastly develop data-driven approaches like machine learning in healthcare and achieved great successes. However, effective machine learning models which could make the accurate early prediction of the recovery in pediatric sepsis patients from MODS to a mild state and thus assist the clinicians in the decision-making process is still lacking. RESULTS: This study develops a machine learning-based approach to predict the recovery from MODS to zero or single organ dysfunction by 1 week in advance in the Swiss Pediatric Sepsis Study cohort of children with blood-culture confirmed bacteremia. Our model achieves internal validation performance on the SPSS cohort with an area under the receiver operating characteristic (AUROC) of 79.1% and area under the precision-recall curve (AUPRC) of 73.6%, and it was also externally validated on another pediatric sepsis patients cohort collected in the USA, yielding an AUROC of 76.4% and AUPRC of 72.4%. These results indicate that our model has the potential to be included into the EHRs system and contribute to patient assessment and triage in pediatric sepsis patient care. AVAILABILITY AND IMPLEMENTATION: Code available at https://github.com/BorgwardtLab/MODS-recovery. The data underlying this article is not publicly available for the privacy of individuals that participated in the study. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Insuficiência de Múltiplos Órgãos , Sepse , Criança , Estudos de Coortes , Humanos , Unidades de Terapia Intensiva Pediátrica , Insuficiência de Múltiplos Órgãos/diagnóstico , Insuficiência de Múltiplos Órgãos/etiologia , Curva ROC , Sepse/complicações , Sepse/diagnósticoRESUMO
Transmission chains within small urban areas (accommodating â¼30 per cent of the European population) greatly contribute to case burden and economic impact during the ongoing coronavirus pandemic and should be a focus for preventive measures to achieve containment. Here, at very high spatio-temporal resolution, we analysed determinants of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) transmission in a European urban area, Basel-City (Switzerland). We combined detailed epidemiological, intra-city mobility and socio-economic data sets with whole-genome sequencing during the first SARS-CoV-2 wave. For this, we succeeded in sequencing 44 per cent of all reported cases from Basel-City and performed phylogenetic clustering and compartmental modelling based on the dominating viral variant (B.1-C15324T; 60 per cent of cases) to identify drivers and patterns of transmission. Based on these results we simulated vaccination scenarios and corresponding healthcare system burden (intensive care unit (ICU) occupancy). Transmissions were driven by socio-economically weaker and highly mobile population groups with mostly cryptic transmissions which lacked genetic and identifiable epidemiological links. Amongst more senior population transmission was clustered. Simulated vaccination scenarios assuming 60-90 per cent transmission reduction and 70-90 per cent reduction of severe cases showed that prioritising mobile, socio-economically weaker populations for vaccination would effectively reduce case numbers. However, long-term ICU occupation would also be effectively reduced if senior population groups were prioritised, provided there were no changes in testing and prevention strategies. Reducing SARS-CoV-2 transmission through vaccination strongly depends on the efficacy of the deployed vaccine. A combined strategy of protecting risk groups by extensive testing coupled with vaccination of the drivers of transmission (i.e. highly mobile groups) would be most effective at reducing the spread of SARS-CoV-2 within an urban area.
RESUMO
Bacterial processes necessary for adaption to stressful host environments are potential targets for new antimicrobials. Here, we report large-scale transcriptomic analyses of 32 human bacterial pathogens grown under 11 stress conditions mimicking human host environments. The potential relevance of the in vitro stress conditions and responses is supported by comparisons with available in vivo transcriptomes of clinically important pathogens. Calculation of a probability score enables comparative cross-microbial analyses of the stress responses, revealing common and unique regulatory responses to different stresses, as well as overlapping processes participating in different stress responses. We identify conserved and species-specific 'universal stress responders', that is, genes showing altered expression in multiple stress conditions. Non-coding RNAs are involved in a substantial proportion of the responses. The data are collected in a freely available, interactive online resource (PATHOgenex).
Assuntos
Regulação Bacteriana da Expressão Gênica , Bactérias Gram-Negativas/genética , Bactérias Gram-Positivas/genética , RNA Bacteriano/genética , Estresse Fisiológico/genética , Transcriptoma , Adaptação Fisiológica/genética , Atlas como Assunto , Bases de Dados Genéticas , Perfilação da Expressão Gênica , Genes Bacterianos , Bactérias Gram-Negativas/classificação , Bactérias Gram-Negativas/metabolismo , Bactérias Gram-Negativas/patogenicidade , Bactérias Gram-Positivas/classificação , Bactérias Gram-Positivas/metabolismo , Bactérias Gram-Positivas/patogenicidade , Interações entre Hospedeiro e Microrganismos/genética , Humanos , Internet , Microbiota/genética , Filogenia , RNA Bacteriano/metabolismoRESUMO
Many traits, such as height, the response to a given drug, or the susceptibility to certain diseases are presumably co-determined by genetics. Especially in the field of medicine, it is of major interest to identify genetic aberrations that alter an individual's risk to develop a certain phenotypic trait. Addressing this question requires the availability of comprehensive, high-quality genetic datasets. The technological advancements and the decreasing cost of genotyping in the last decade led to an increase in such datasets. Parallel to and in line with this technological progress, an analysis framework under the name of genome-wide association studies was developed to properly collect and analyze these data. Genome-wide association studies aim at finding statistical dependencies-or associations-between a trait of interest and point-mutations in the DNA. The statistical models used to detect such associations are diverse, spanning the whole range from the frequentist to the Bayesian setting.Since genetic datasets are inherently high-dimensional, the search for associations poses not only a statistical but also a computational challenge. As a result, a variety of toolboxes and software packages have been developed, each implementing different statistical methods while using various optimizations and mathematical techniques to enhance the computations.This chapter is devoted to the discussion of widely used methods and tools in genome-wide association studies. We present the different statistical models and the assumptions on which they are based, explain peculiarities of the data that have to be accounted for and, most importantly, introduce commonly used tools and software packages for the different tasks in a genome-wide association study, complemented with examples for their application.
Assuntos
Bases de Dados Genéticas , Estudo de Associação Genômica Ampla/métodos , Modelos Genéticos , Mutação Puntual , Característica Quantitativa Herdável , Animais , HumanosRESUMO
Motivation: Methods based on summary statistics obtained from genome-wide association studies have gained considerable interest in genetics due to the computational cost and privacy advantages they present. Imputing missing summary statistics has therefore become a key procedure in many bioinformatics pipelines, but available solutions may rely on additional knowledge about the populations used in the original study and, as a result, may not always ensure feasibility or high accuracy of the imputation procedure. Results: We present ARDISS, a method to impute missing summary statistics in mixed-ethnicity cohorts through Gaussian Process Regression and automatic relevance determination. ARDISS is trained on an external reference panel and does not require information about allele frequencies of genotypes from the original study. Our method approximates the original GWAS population by a combination of samples from a reference panel relying exclusively on the summary statistics and without any external information. ARDISS successfully reconstructs the original composition of mixed-ethnicity cohorts and outperforms alternative solutions in terms of speed and imputation accuracy both for heterogeneous and homogeneous datasets. Availability and implementation: The proposed method is available at https://github.com/BorgwardtLab/ARDISS. Supplementary information: Supplementary data are available at Bioinformatics online.
Assuntos
Etnicidade/genética , Estudos de Coortes , Frequência do Gene , Estudo de Associação Genômica Ampla/métodos , Genótipo , Humanos , SoftwareRESUMO
The ever-growing availability of high-quality genotypes for a multitude of species has enabled researchers to explore the underlying genetic architecture of complex phenotypes at an unprecedented level of detail using genome-wide association studies (GWAS). The systematic comparison of results obtained from GWAS of different traits opens up new possibilities, including the analysis of pleiotropic effects. Other advantages that result from the integration of multiple GWAS are the ability to replicate GWAS signals and to increase statistical power to detect such signals through meta-analyses. In order to facilitate the simple comparison of GWAS results, we present easyGWAS, a powerful, species-independent online resource for computing, storing, sharing, annotating, and comparing GWAS. The easyGWAS tool supports multiple species, the uploading of private genotype data and summary statistics of existing GWAS, as well as advanced methods for comparing GWAS results across different experiments and data sets in an interactive and user-friendly interface. easyGWAS is also a public data repository for GWAS data and summary statistics and already includes published data and results from several major GWAS. We demonstrate the potential of easyGWAS with a case study of the model organism Arabidopsis thaliana, using flowering and growth-related traits.
Assuntos
Biologia Computacional/métodos , Genoma de Planta/genética , Estudo de Associação Genômica Ampla/métodos , Polimorfismo de Nucleotídeo Único , Arabidopsis/genética , Arabidopsis/crescimento & desenvolvimento , Flores/genética , Flores/crescimento & desenvolvimento , Genótipo , Humanos , Fenótipo , Reprodutibilidade dos Testes , Software , Interface Usuário-ComputadorRESUMO
The ubiquity of nonparental hybrid phenotypes, such as hybrid vigor and hybrid inferiority, has interested biologists for over a century and is of considerable agricultural importance. Although examples of both phenomena have been subject to intense investigation, no general model for the molecular basis of nonadditive genetic variance has emerged, and prediction of hybrid phenotypes from parental information continues to be a challenge. Here we explore the genetics of hybrid phenotype in 435 Arabidopsis thaliana individuals derived from intercrosses of 30 parents in a half diallel mating scheme. We find that nonadditive genetic effects are a major component of genetic variation in this population and that the genetic basis of hybrid phenotype can be mapped using genome-wide association (GWA) techniques. Significant loci together can explain as much as 20% of phenotypic variation in the surveyed population and include examples that have both classical dominant and overdominant effects. One candidate region inherited dominantly in the half diallel contains the gene for the MADS-box transcription factor AGAMOUS-LIKE 50 (AGL50), which we show directly to alter flowering time in the predicted manner. Our study not only illustrates the promise of GWA approaches to dissect the genetic architecture underpinning hybrid performance but also demonstrates the contribution of classical dominance to genetic variance.
Assuntos
Arabidopsis/genética , Vigor Híbrido/genética , Cruzamentos Genéticos , Variação Genética , Hibridização Genética , FenótipoRESUMO
Understanding how new species form requires investigation of evolutionary forces that cause phenotypic and genotypic changes among populations. However, the mechanisms underlying speciation vary and little is known about whether genomes diversify in the same ways in parallel at the incipient scale. We address this using the nematode, Pristionchus pacificus, which resides at an interesting point on the speciation continuum (distinct evolutionary lineages without reproductive isolation), and inhabits heterogeneous environments subject to divergent environmental pressures. Using whole genome re-sequencing of 264 strains, we estimate FST to identify outlier regions of extraordinary differentiation (â¼1.725 Mb of the 172.5 Mb genome). We find evidence for shared divergent genomic regions occurring at a higher frequency than expected by chance among populations of the same evolutionary lineage. We use allele frequency spectra to find that, among lineages, 53% of divergent regions are consistent with adaptive selection, whereas 24% and 23% of such regions suggest background selection and restricted gene flow, respectively. In contrast, among populations from the same lineage, similar proportions (34-48%) of divergent regions correspond to adaptive selection and restricted gene flow, whereas 13-22% suggest background selection. Because speciation often involves phenotypic and genomic divergence, we also evaluate phenotypic variation, focusing on pH tolerance, which we find is diverging in a manner corresponding to environmental differences among populations. Taking a genome-wide association approach, we functionally validate a significant genotype-phenotype association for this trait. Our results are consistent with P. pacificus undergoing heterogeneous genotypic and phenotypic diversification related to both evolutionary and environmental processes.
Assuntos
Rabditídios/genética , Animais , Evolução Biológica , Evolução Molecular , Fluxo Gênico , Frequência do Gene , Estudos de Associação Genética , Especiação Genética , Variação Genética , Isolamento Reprodutivo , Seleção Genética , TranscriptomaRESUMO
Prioritizing missense variants for further experimental investigation is a key challenge in current sequencing studies for exploring complex and Mendelian diseases. A large number of in silico tools have been employed for the task of pathogenicity prediction, including PolyPhen-2, SIFT, FatHMM, MutationTaster-2, MutationAssessor, Combined Annotation Dependent Depletion, LRT, phyloP, and GERP++, as well as optimized methods of combining tool scores, such as Condel and Logit. Due to the wealth of these methods, an important practical question to answer is which of these tools generalize best, that is, correctly predict the pathogenic character of new variants. We here demonstrate in a study of 10 tools on five datasets that such a comparative evaluation of these tools is hindered by two types of circularity: they arise due to (1) the same variants or (2) different variants from the same protein occurring both in the datasets used for training and for evaluation of these tools, which may lead to overly optimistic results. We show that comparative evaluations of predictors that do not address these types of circularity may erroneously conclude that circularity confounded tools are most accurate among all tools, and may even outperform optimized combinations of tools.
Assuntos
Biologia Computacional/métodos , Mutação de Sentido Incorreto , Software , Conjuntos de Dados como Assunto , Humanos , Internet , Reprodutibilidade dos Testes , NavegadorRESUMO
MOTIVATION: As an increasing number of genome-wide association studies reveal the limitations of the attempt to explain phenotypic heritability by single genetic loci, there is a recent focus on associating complex phenotypes with sets of genetic loci. Although several methods for multi-locus mapping have been proposed, it is often unclear how to relate the detected loci to the growing knowledge about gene pathways and networks. The few methods that take biological pathways or networks into account are either restricted to investigating a limited number of predetermined sets of loci or do not scale to genome-wide settings. RESULTS: We present SConES, a new efficient method to discover sets of genetic loci that are maximally associated with a phenotype while being connected in an underlying network. Our approach is based on a minimum cut reformulation of the problem of selecting features under sparsity and connectivity constraints, which can be solved exactly and rapidly. SConES outperforms state-of-the-art competitors in terms of runtime, scales to hundreds of thousands of genetic loci and exhibits higher power in detecting causal SNPs in simulation studies than other methods. On flowering time phenotypes and genotypes from Arabidopsis thaliana, SConES detects loci that enable accurate phenotype prediction and that are supported by the literature. AVAILABILITY: Code is available at http://webdav.tuebingen.mpg.de/u/karsten/Forschung/scones/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Loci Gênicos , Estudo de Associação Genômica Ampla/métodos , Fenótipo , Polimorfismo de Nucleotídeo Único , Arabidopsis/genética , Arabidopsis/crescimento & desenvolvimento , Flores , Genótipo , HumanosRESUMO
Due to recent advances in genotyping technologies, mapping phenotypes to single loci in the genome has become a standard technique in statistical genetics. However, one-locus mapping fails to explain much of the phenotypic variance in complex traits. Here, we present GLIDE, which maps phenotypes to pairs of genetic loci and systematically searches for the epistatic interactions expected to reveal part of this missing heritability. GLIDE makes use of the computational power of consumer-grade graphics cards to detect such interactions via linear regression. This enabled us to conduct a systematic two-locus mapping study on seven disease data sets from the Wellcome Trust Case Control Consortium and on in-house hippocampal volume data in 6 h per data set, while current single CPU-based approaches require more than a year's time to complete the same task.
Assuntos
Mapeamento Cromossômico/métodos , Biologia Computacional/métodos , Epistasia Genética , Predisposição Genética para Doença , Transtorno Bipolar/diagnóstico , Transtorno Bipolar/epidemiologia , Transtorno Bipolar/genética , Bases de Dados Factuais , Loci Gênicos , Genética Populacional/métodos , Estudo de Associação Genômica Ampla , Hipocampo/anatomia & histologia , Humanos , Modelos Lineares , Tamanho do Órgão , Fenótipo , Polimorfismo de Nucleotídeo Único , Reprodutibilidade dos Testes , Fatores de TempoRESUMO
MOTIVATION: Accurate large-scale phenotyping has recently gained considerable importance in biology. For example, in genome-wide association studies technological advances have rendered genotyping cheap, leaving phenotype acquisition as the major bottleneck. Automatic image analysis is one major strategy to phenotype individuals in large numbers. Current approaches for visual phenotyping focus predominantly on summarizing statistics and geometric measures, such as height and width of an individual, or color histograms and patterns. However, more subtle, but biologically informative phenotypes, such as the local deformation of the shape of an individual with respect to the population mean cannot be automatically extracted and quantified by current techniques. RESULTS: We propose a probabilistic machine learning model that allows for the extraction of deformation phenotypes from biological images, making them available as quantitative traits for downstream analysis. Our approach jointly models a collection of images using a learned common template that is mapped onto each image through a deformable smooth transformation. In a case study, we analyze the shape deformations of 388 guppy fish (Poecilia reticulata). We find that the flexible shape phenotypes our model extracts are complementary to basic geometric measures. Moreover, these quantitative traits assort the observations into distinct groups and can be mapped to polymorphic genetic loci of the sample set. AVAILABILITY: Code is available under: http://bioweb.me/GEBI CONTACT: theofanis.karaletsos@tuebingen.mpg.de; oliver.stegle@tuebingen.mpg.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Inteligência Artificial , Processamento de Imagem Assistida por Computador/métodos , Reconhecimento Automatizado de Padrão/métodos , Fenótipo , Animais , Análise por Conglomerados , Biologia Computacional/métodos , Masculino , Cadeias de Markov , Modelos Estatísticos , PoeciliaRESUMO
Understanding the regulatory mechanisms that are responsible for an organism's response to environmental change is an important issue in molecular biology. A first and important step towards this goal is to detect genes whose expression levels are affected by altered external conditions. A range of methods to test for differential gene expression, both in static as well as in time-course experiments, have been proposed. While these tests answer the question whether a gene is differentially expressed, they do not explicitly address the question when a gene is differentially expressed, although this information may provide insights into the course and causal structure of regulatory programs. In this article, we propose a two-sample test for identifying intervals of differential gene expression in microarray time series. Our approach is based on Gaussian process regression, can deal with arbitrary numbers of replicates, and is robust with respect to outliers. We apply our algorithm to study the response of Arabidopsis thaliana genes to an infection by a fungal pathogen using a microarray time series dataset covering 30,336 gene probes at 24 observed time points. In classification experiments, our test compares favorably with existing methods and provides additional insights into time-dependent differential expression.
Assuntos
Arabidopsis/genética , Perfilação da Expressão Gênica , Regulação da Expressão Gênica de Plantas , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Arabidopsis/microbiologia , Área Sob a Curva , Teorema de Bayes , Biologia Computacional , Genes de Plantas/genética , Modelos Genéticos , Família Multigênica/genética , Distribuição Normal , Fatores de TempoRESUMO
MOTIVATION: Synthetic lethal interactions represent pairs of genes whose individual mutations are not lethal, while the double mutation of both genes does incur lethality. Several studies have shown a correlation between functional similarity of genes and their distances in networks based on synthetic lethal interactions. However, there is a lack of algorithms for predicting gene function from synthetic lethality interaction networks. RESULTS: In this article, we present a novel technique called kernelROD for gene function prediction from synthetic lethal interaction networks based on kernel machines. We apply our novel algorithm to Gene Ontology functional annotation prediction in yeast. Our experiments show that our method leads to improved gene function prediction compared with state-of-the-art competitors and that combining genetic and congruence networks leads to a further improvement in prediction accuracy.
Assuntos
Genes Letais , Genômica/métodos , Algoritmos , Redes Reguladoras de Genes , Genoma Fúngico , Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/genéticaRESUMO
It is widely believed that comparing discrepancies in the protein-protein interaction (PPI) networks of individuals will become an important tool in understanding and preventing diseases. Currently PPI networks for individuals are not available, but gene expression data is becoming easier to obtain and allows us to represent individuals by a co-integrated gene expression/protein interaction network. Two major problems hamper the application of graph kernels - state-of-the-art methods for whole-graph comparison - to compare PPI networks. First, these methods do not scale to graphs of the size of a PPI network. Second, missing edges in these interaction networks are biologically relevant for detecting discrepancies, yet, these methods do not take this into account. In this article we present graph kernels for biological network comparison that are fast to compute and take into account missing interactions. We evaluate their practical performance on two datasets of co-integrated gene expression/PPI networks.
Assuntos
Mapeamento de Interação de Proteínas/estatística & dados numéricos , Biologia Computacional , Bases de Dados Genéticas , Progressão da Doença , Perfilação da Expressão Gênica/estatística & dados numéricos , Humanos , Prognóstico , Análise Serial de Proteínas/estatística & dados numéricosRESUMO
MOTIVATION: Identifying significant genes among thousands of sequences on a microarray is a central challenge for cancer research in bioinformatics. The ultimate goal is to detect the genes that are involved in disease outbreak and progression. A multitude of methods have been proposed for this task of feature selection, yet the selected gene lists differ greatly between different methods. To accomplish biologically meaningful gene selection from microarray data, we have to understand the theoretical connections and the differences between these methods. In this article, we define a kernel-based framework for feature selection based on the Hilbert-Schmidt independence criterion and backward elimination, called BAHSIC. We show that several well-known feature selectors are instances of BAHSIC, thereby clarifying their relationship. Furthermore, by choosing a different kernel, BAHSIC allows us to easily define novel feature selection algorithms. As a further advantage, feature selection via BAHSIC works directly on multiclass problems. RESULTS: In a broad experimental evaluation, the members of the BAHSIC family reach high levels of accuracy and robustness when compared to other feature selection techniques. Experiments show that features selected with a linear kernel provide the best classification performance in general, but if strong non-linearities are present in the data then non-linear kernels can be more suitable. AVAILABILITY: Accompanying homepage is http://www.dbs.ifi.lmu.de/~borgward/BAHSIC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , Biomarcadores Tumorais/análise , Diagnóstico por Computador/métodos , Perfilação da Expressão Gênica/métodos , Proteínas de Neoplasias/análise , Neoplasias/diagnóstico , Neoplasias/metabolismo , Humanos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Sensibilidade e EspecificidadeRESUMO
We present a kernel-based approach to the classification of time series of gene expression profiles. Our method takes into account the dynamic evolution over time as well as the temporal characteristics of the data. More specifically, we model the evolution of the gene expression profiles as a Linear Time Invariant (LTI) dynamical system and estimate its model parameters. A kernel on dynamical systems is then used to classify these time series. We successfully test our approach on a published dataset to predict response to drug therapy in Multiple Sclerosis patients. For pharmacogenomics, our method offers a huge potential for advanced computational tools in disease diagnosis, and disease and drug therapy outcome prognosis.
Assuntos
Perfilação da Expressão Gênica/estatística & dados numéricos , Inteligência Artificial , Biologia Computacional , Bases de Dados Genéticas , Humanos , Modelos Lineares , Modelos Genéticos , Esclerose Múltipla/tratamento farmacológico , Esclerose Múltipla/genética , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Farmacogenética/estatística & dados numéricos , Fatores de TempoRESUMO
MOTIVATION: Many problems in data integration in bioinformatics can be posed as one common question: Are two sets of observations generated by the same distribution? We propose a kernel-based statistical test for this problem, based on the fact that two distributions are different if and only if there exists at least one function having different expectation on the two distributions. Consequently we use the maximum discrepancy between function means as the basis of a test statistic. The Maximum Mean Discrepancy (MMD) can take advantage of the kernel trick, which allows us to apply it not only to vectors, but strings, sequences, graphs, and other common structured data types arising in molecular biology. RESULTS: We study the practical feasibility of an MMD-based test on three central data integration tasks: Testing cross-platform comparability of microarray data, cancer diagnosis, and data-content based schema matching for two different protein function classification schemas. In all of these experiments, including high-dimensional ones, MMD is very accurate in finding samples that were generated from the same distribution, and outperforms its best competitors. CONCLUSIONS: We have defined a novel statistical test of whether two samples are from the same distribution, compatible with both multivariate and structured data, that is fast, easy to implement, and works well, as confirmed by our experiments. AVAILABILITY: http://www.dbs.ifi.lmu.de/~borgward/MMD.
Assuntos
Algoritmos , Biologia Computacional/métodos , Interpretação Estatística de Dados , Bases de Dados Factuais , Armazenamento e Recuperação da Informação/métodos , Modelos Biológicos , Modelos Estatísticos , Simulação por Computador , Tamanho da Amostra , Distribuições Estatísticas , Integração de SistemasRESUMO
MOTIVATION: Computational approaches to protein function prediction infer protein function by finding proteins with similar sequence, structure, surface clefts, chemical properties, amino acid motifs, interaction partners or phylogenetic profiles. We present a new approach that combines sequential, structural and chemical information into one graph model of proteins. We predict functional class membership of enzymes and non-enzymes using graph kernels and support vector machine classification on these protein graphs. RESULTS: Our graph model, derivable from protein sequence and structure only, is competitive with vector models that require additional protein information, such as the size of surface pockets. If we include this extra information into our graph model, our classifier yields significantly higher accuracy levels than the vector models. Hyperkernels allow us to select and to optimally combine the most relevant node attributes in our protein graphs. We have laid the foundation for a protein function prediction system that integrates protein information from various sources efficiently and effectively. AVAILABILITY: More information available via www.dbs.ifi.lmu.de/Mitarbeiter/borgwardt.html.