Pesquisa | Portal de Pesquisa da BVS Enfermagem

Assessment of machine learning reliability methods for quantifying the applicability domain of QSAR regression models.

Toplak, Marko; Mocnik, Rok; Polajnar, Matija; Bosnic, Zoran; Carlsson, Lars; Hasselgren, Catrin; Demsar, Janez; Boyer, Scott; Zupan, Blaz; Stålring, Jonna.

J Chem Inf Model ; 54(2): 431-41, 2014 Feb 24.

Artigo em Inglês | MEDLINE | ID: mdl-24490838

RESUMO

The vastness of chemical space and the relatively small coverage by experimental data recording molecular properties require us to identify subspaces, or domains, for which we can confidently apply QSAR models. The prediction of QSAR models in these domains is reliable, and potential subsequent investigations of such compounds would find that the predictions closely match the experimental values. Standard approaches in QSAR assume that predictions are more reliable for compounds that are "similar" to those in subspaces with denser experimental data. Here, we report on a study of an alternative set of techniques recently proposed in the machine learning community. These methods quantify prediction confidence through estimation of the prediction error at the point of interest. Our study includes 20 public QSAR data sets with continuous response and assesses the quality of 10 reliability scoring methods by observing their correlation with prediction error. We show that these new alternative approaches can outperform standard reliability scores that rely only on similarity to compounds in the training set. The results also indicate that the quality of reliability scoring methods is sensitive to data set characteristics and to the regression method used in QSAR. We demonstrate that at the cost of increased computational complexity these dependencies can be leveraged by integration of scores from various reliability estimation approaches. The reliability estimation techniques described in this paper have been implemented in an open source add-on package ( https://bitbucket.org/biolab/orange-reliability ) to the Orange data mining suite.

Assuntos

Inteligência Artificial , Descoberta de Drogas/métodos , Relação Quantitativa Estrutura-Atividade , Algoritmos , Análise de Regressão , Fatores de Tempo

Cell-type specificity of ChIP-predicted transcription factor binding sites.

Håndstad, Tony; Rye, Morten; Mocnik, Rok; Drabløs, Finn; Sætrom, Pål.

BMC Genomics ; 13: 372, 2012 Aug 03.

Artigo em Inglês | MEDLINE | ID: mdl-22863112

RESUMO

BACKGROUND: Context-dependent transcription factor (TF) binding is one reason for differences in gene expression patterns between different cellular states. Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) identifies genome-wide TF binding sites for one particular context-the cells used in the experiment. But can such ChIP-seq data predict TF binding in other cellular contexts and is it possible to distinguish context-dependent from ubiquitous TF binding? RESULTS: We compared ChIP-seq data on TF binding for multiple TFs in two different cell types and found that on average only a third of ChIP-seq peak regions are common to both cell types. Expectedly, common peaks occur more frequently in certain genomic contexts, such as CpG-rich promoters, whereas chromatin differences characterize cell-type specific TF binding. We also find, however, that genotype differences between the cell types can explain differences in binding. Moreover, ChIP-seq signal intensity and peak clustering are the strongest predictors of common peaks. Compared with strong peaks located in regions containing peaks for multiple transcription factors, weak and isolated peaks are less common between the cell types and are less associated with data that indicate regulatory activity. CONCLUSIONS: Together, the results suggest that experimental noise is prevalent among weak peaks, whereas strong and clustered peaks represent high-confidence binding events that often occur in other cellular contexts. Nevertheless, 30-40% of the strongest and most clustered peaks show context-dependent regulation. We show that by combining signal intensity with additional data-ranging from context independent information such as binding site conservation and position weight matrix scores to context dependent chromatin structure-we can predict whether a ChIP-seq peak is likely to be present in other cellular contexts.

Assuntos

Sítios de Ligação/genética , Imunoprecipitação da Cromatina/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , Sequência de Bases , Linhagem Celular Tumoral , Cromatina/genética , Cromatina/metabolismo , DNA/genética , DNA/metabolismo , Expressão Gênica , Redes Reguladoras de Genes , Genótipo , Células HeLa , Histonas/genética , Humanos , Polimorfismo de Nucleotídeo Único , Sequências Reguladoras de Ácido Nucleico , Análise de Sequência de DNA

The eGenVar data management system--cataloguing and sharing sensitive data and metadata for the life sciences.

Razick, Sabry; Mocnik, Rok; Thomas, Laurent F; Ryeng, Einar; Drabløs, Finn; Sætrom, Pål.

Database (Oxford) ; 2014: bau027, 2014.

Artigo em Inglês | MEDLINE | ID: mdl-24682735

RESUMO

Systematic data management and controlled data sharing aim at increasing reproducibility, reducing redundancy in work, and providing a way to efficiently locate complementing or contradicting information. One method of achieving this is collecting data in a central repository or in a location that is part of a federated system and providing interfaces to the data. However, certain data, such as data from biobanks or clinical studies, may, for legal and privacy reasons, often not be stored in public repositories. Instead, we describe a metadata cataloguing system and a software suite for reporting the presence of data from the life sciences domain. The system stores three types of metadata: file information, file provenance and data lineage, and content descriptions. Our software suite includes both graphical and command line interfaces that allow users to report and tag files with these different metadata types. Importantly, the files remain in their original locations with their existing access-control mechanisms in place, while our system provides descriptions of their contents and relationships. Our system and software suite thereby provide a common framework for cataloguing and sharing both public and private data. Database URL: http://bigr.medisin.ntnu.no/data/eGenVar/.

Assuntos

Disciplinas das Ciências Biológicas , Sistemas de Gerenciamento de Base de Dados , Bases de Dados Genéticas , Disseminação de Informação , Software , Ontologias Biológicas , Ferramenta de Busca , Terminologia como Assunto

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA