Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 18 de 18
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Gigascience ; 8(5)2019 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-31029061

RESUMO

BACKGROUND: The complex nature of biological data has driven the development of specialized software tools. Scientific workflow management systems simplify the assembly of such tools into pipelines, assist with job automation, and aid reproducibility of analyses. Many contemporary workflow tools are specialized or not designed for highly complex workflows, such as with nested loops, dynamic scheduling, and parametrization, which is common in, e.g., machine learning. FINDINGS: SciPipe is a workflow programming library implemented in the programming language Go, for managing complex and dynamic pipelines in bioinformatics, cheminformatics, and other fields. SciPipe helps in particular with workflow constructs common in machine learning, such as extensive branching, parameter sweeps, and dynamic scheduling and parametrization of downstream tasks. SciPipe builds on flow-based programming principles to support agile development of workflows based on a library of self-contained, reusable components. It supports running subsets of workflows for improved iterative development and provides a data-centric audit logging feature that saves a full audit trace for every output file of a workflow, which can be converted to other formats such as HTML, TeX, and PDF on demand. The utility of SciPipe is demonstrated with a machine learning pipeline, a genomics, and a transcriptomics pipeline. CONCLUSIONS: SciPipe provides a solution for agile development of complex and dynamic pipelines, especially in machine learning, through a flexible application programming interface suitable for scientists used to programming or scripting.


Assuntos
Biologia Computacional , Genômica , Software , Biblioteca Gênica , Aprendizado de Máquina , Linguagens de Programação , Fluxo de Trabalho
2.
Front Pharmacol ; 9: 1256, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30459617

RESUMO

Ligand-based models can be used in drug discovery to obtain an early indication of potential off-target interactions that could be linked to adverse effects. Another application is to combine such models into a panel, allowing to compare and search for compounds with similar profiles. Most contemporary methods and implementations however lack valid measures of confidence in their predictions, and only provide point predictions. We here describe a methodology that uses Conformal Prediction for predicting off-target interactions, with models trained on data from 31 targets in the ExCAPE-DB dataset selected for their utility in broad early hazard assessment. Chemicals were represented by the signature molecular descriptor and support vector machines were used as the underlying machine learning method. By using conformal prediction, the results from predictions come in the form of confidence p-values for each class. The full pre-processing and model training process is openly available as scientific workflows on GitHub, rendering it fully reproducible. We illustrate the usefulness of the developed methodology on a set of compounds extracted from DrugBank. The resulting models are published online and are available via a graphical web interface and an OpenAPI interface for programmatic access.

3.
J Cheminform ; 10(1): 49, 2018 Oct 11.
Artigo em Inglês | MEDLINE | ID: mdl-30306349

RESUMO

Ligand-based predictive modeling is widely used to generate predictive models aiding decision making in e.g. drug discovery projects. With growing data sets and requirements on low modeling time comes the necessity to analyze data sets efficiently to support rapid and robust modeling. In this study we analyzed four data sets and studied the efficiency of machine learning methods on sparse data structures, utilizing Morgan fingerprints of different radii and hash sizes, and compared with molecular signatures descriptor of different height. We specifically evaluated the effect these parameters had on modeling time, predictive performance, and memory requirements using two implementations of random forest; Scikit-learn as well as FEST. We also compared with a support vector machine implementation. Our results showed that unhashed fingerprints yield significantly better accuracy than hashed fingerprints ([Formula: see text]), with no pronounced deterioration in modeling time and memory usage. Furthermore, the fast execution and low memory usage of the FEST algorithm suggest that it is a good alternative for large, high dimensional sparse data. Both support vector machines and random forest performed equally well but results indicate that the support vector machine was better at using the extra information from larger values of the Morgan fingerprint's radius.

4.
J Cheminform ; 10(1): 17, 2018 Apr 03.
Artigo em Inglês | MEDLINE | ID: mdl-29616425

RESUMO

Lipophilicity is a major determinant of ADMET properties and overall suitability of drug candidates. We have developed large-scale models to predict water-octanol distribution coefficient (logD) for chemical compounds, aiding drug discovery projects. Using ACD/logD data for 1.6 million compounds from the ChEMBL database, models are created and evaluated by a support-vector machine with a linear kernel using conformal prediction methodology, outputting prediction intervals at a specified confidence level. The resulting model shows a predictive ability of [Formula: see text] and with the best performing nonconformity measure having median prediction interval of [Formula: see text] log units at 80% confidence and [Formula: see text] log units at 90% confidence. The model is available as an online service via an OpenAPI interface, a web page with a molecular editor, and we also publish predictive values at 90% confidence level for 91 M PubChem structures in RDF format for download and as an URI resolver service.

5.
J Cheminform ; 9(1): 33, 2017 Jun 06.
Artigo em Inglês | MEDLINE | ID: mdl-29086040

RESUMO

BACKGROUND: The Chemistry Development Kit (CDK) is a widely used open source cheminformatics toolkit, providing data structures to represent chemical concepts along with methods to manipulate such structures and perform computations on them. The library implements a wide variety of cheminformatics algorithms ranging from chemical structure canonicalization to molecular descriptor calculations and pharmacophore perception. It is used in drug discovery, metabolomics, and toxicology. Over the last 10 years, the code base has grown significantly, however, resulting in many complex interdependencies among components and poor performance of many algorithms. RESULTS: We report improvements to the CDK v2.0 since the v1.2 release series, specifically addressing the increased functional complexity and poor performance. We first summarize the addition of new functionality, such atom typing and molecular formula handling, and improvement to existing functionality that has led to significantly better performance for substructure searching, molecular fingerprints, and rendering of molecules. Second, we outline how the CDK has evolved with respect to quality control and the approaches we have adopted to ensure stability, including a code review mechanism. CONCLUSIONS: This paper highlights our continued efforts to provide a community driven, open source cheminformatics library, and shows that such collaborative projects can thrive over extended periods of time, resulting in a high-quality and performant library. By taking advantage of community support and contributions, we show that an open source cheminformatics project can act as a peer reviewed publishing platform for scientific computing software. Graphical abstract CDK 2.0 provides new features and improved performance.

7.
J Cheminform ; 8: 67, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27942268

RESUMO

Predictive modelling in drug discovery is challenging to automate as it often contains multiple analysis steps and might involve cross-validation and parameter tuning that create complex dependencies between tasks. With large-scale data or when using computationally demanding modelling methods, e-infrastructures such as high-performance or cloud computing are required, adding to the existing challenges of fault-tolerant automation. Workflow management systems can aid in many of these challenges, but the currently available systems are lacking in the functionality needed to enable agile and flexible predictive modelling. We here present an approach inspired by elements of the flow-based programming paradigm, implemented as an extension of the Luigi system which we name SciLuigi. We also discuss the experiences from using the approach when modelling a large set of biochemical interactions using a shared computer cluster.Graphical abstract.

8.
J Cheminform ; 8: 39, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-27516811

RESUMO

The increasing size of datasets in drug discovery makes it challenging to build robust and accurate predictive models within a reasonable amount of time. In order to investigate the effect of dataset sizes on predictive performance and modelling time, ligand-based regression models were trained on open datasets of varying sizes of up to 1.2 million chemical structures. For modelling, two implementations of support vector machines (SVM) were used. Chemical structures were described by the signatures molecular descriptor. Results showed that for the larger datasets, the LIBLINEAR SVM implementation performed on par with the well-established libsvm with a radial basis function kernel, but with dramatically less time for model building even on modest computer resources. Using a non-linear kernel proved to be infeasible for large data sizes, even with substantial computational resources on a computer cluster. To deploy the resulting models, we extended the Bioclipse decision support framework to support models from LIBLINEAR and made our models of logD and solubility available from within Bioclipse.

9.
J Lab Autom ; 21(1): 178-87, 2016 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-26246423

RESUMO

Although medical cancer treatment has improved during the past decades, it is difficult to choose between several first-line treatments supposed to be equally active in the diagnostic group. It is even more difficult to select a treatment after the standard protocols have failed. Any guidance for selection of the most effective treatment is valuable at these critical stages. We describe the principles and procedures for ex vivo assessment of drug activity in tumor cells from patients as a basis for tailored cancer treatment. Patient tumor cells are assayed for cytotoxicity with a panel of drugs. Acoustic drug dispensing provides great flexibility in the selection of drugs for testing; currently, up to 80 compounds and/or combinations thereof may be tested for each patient. Drug response predictions are obtained by classification using an empirical model based on historical responses for the diagnosis. The laboratory workflow is supported by an integrated system that enables rapid analysis and automatic generation of the clinical referral response.


Assuntos
Antineoplásicos/farmacologia , Técnicas Citológicas/métodos , Ensaios de Seleção de Medicamentos Antitumorais/métodos , Acústica , Sobrevivência Celular/efeitos dos fármacos , Células Cultivadas , Humanos , Neoplasias
10.
J Chem Inf Model ; 55(1): 19-25, 2015 Jan 26.
Artigo em Inglês | MEDLINE | ID: mdl-25493610

RESUMO

Growing data sets with increased time for analysis is hampering predictive modeling in drug discovery. Model building can be carried out on high-performance computer clusters, but these can be expensive to purchase and maintain. We have evaluated ligand-based modeling on cloud computing resources where computations are parallelized and run on the Amazon Elastic Cloud. We trained models on open data sets of varying sizes for the end points logP and Ames mutagenicity and compare with model building parallelized on a traditional high-performance computing cluster. We show that while high-performance computing results in faster model building, the use of cloud computing resources is feasible for large data sets and scales well within cloud instances. An additional advantage of cloud computing is that the costs of predictive models can be easily quantified, and a choice can be made between speed and economy. The easy access to computational resources with no up-front investments makes cloud computing an attractive alternative for scientists, especially for those without access to a supercomputer, and our study shows that it enables cost-efficient modeling of large data sets on demand within reasonable time.


Assuntos
Biologia Computacional/métodos , Metodologias Computacionais , Bases de Dados de Compostos Químicos , Descoberta de Drogas/métodos , Relação Quantitativa Estrutura-Atividade , Bases de Dados Factuais , Internet , Ligantes , Software
11.
J Chem Inf Model ; 54(11): 3211-7, 2014 Nov 24.
Artigo em Inglês | MEDLINE | ID: mdl-25318024

RESUMO

QSAR modeling using molecular signatures and support vector machines with a radial basis function is increasingly used for virtual screening in the drug discovery field. This method has three free parameters: C, γ, and signature height. C is a penalty parameter that limits overfitting, γ controls the width of the radial basis function kernel, and the signature height determines how much of the molecule is described by each atom signature. Determination of optimal values for these parameters is time-consuming. Good default values could therefore save considerable computational cost. The goal of this project was to investigate whether such default values could be found by using seven public QSAR data sets spanning a wide range of end points and using both a bit version and a count version of the molecular signatures. On the basis of the experiments performed, we recommend a parameter set of heights 0 to 2 for the count version of the signature fingerprints and heights 0 to 3 for the bit version. These are in combination with a support vector machine using C in the range of 1 to 100 and γ in the range of 0.001 to 0.1. When data sets are small or longer run times are not a problem, then there is reason to consider the addition of height 3 to the count fingerprint and a wider grid search. However, marked improvements should not be expected.


Assuntos
Avaliação Pré-Clínica de Medicamentos/métodos , Máquina de Vetores de Suporte , Benchmarking , Relação Quantitativa Estrutura-Atividade
12.
J Chem Inf Model ; 54(10): 2647-53, 2014 Oct 27.
Artigo em Inglês | MEDLINE | ID: mdl-25230336

RESUMO

When evaluating a potential drug candidate it is desirable to predict target interactions in silico prior to synthesis in order to assess, e.g., secondary pharmacology. This can be done by looking at known target binding profiles of similar compounds using chemical similarity searching. The purpose of this study was to construct and evaluate the performance of chemical fingerprints based on the molecular signature descriptor for performing target binding predictions. For the comparison we used the area under the receiver operating characteristics curve (AUC) complemented with net reclassification improvement (NRI). We created two open source signature fingerprints, a bit and a count version, and evaluated their performance compared to a set of established fingerprints with regards to predictions of binding targets using Tanimoto-based similarity searching on publicly available data sets extracted from ChEMBL. The results showed that the count version of the signature fingerprint performed on par with well-established fingerprints such as ECFP. The count version outperformed the bit version slightly; however, the count version is more complex and takes more computing time and memory to run so its usage should probably be evaluated on a case-by-case basis. The NRI based tests complemented the AUC based ones and showed signs of higher power.


Assuntos
Desenho de Fármacos , Modelos Químicos , Impressão Molecular/métodos , Software , Área Sob a Curva , Simulação por Computador , Bases de Dados de Compostos Químicos , Ligantes , Estrutura Molecular , Curva ROC
13.
Bioinformatics ; 29(2): 286-9, 2013 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-23178637

RESUMO

SUMMARY: Bioclipse, a graphical workbench for the life sciences, provides functionality for managing and visualizing life science data. We introduce Bioclipse-R, which integrates Bioclipse and the statistical programming language R. The synergy between Bioclipse and R is demonstrated by the construction of a decision support system for anticancer drug screening and mutagenicity prediction, which shows how Bioclipse-R can be used to perform complex tasks from within a single software system. AVAILABILITY AND IMPLEMENTATION: Bioclipse-R is implemented as a set of Java plug-ins for Bioclipse based on the R-package rj. Source code and binary packages are available from https://github.com/bioclipse and http://www.bioclipse.net/bioclipse-r, respectively. CONTACT: martin.eklund@farmbio.uu.se SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Disciplinas das Ciências Biológicas , Gráficos por Computador , Software , Antineoplásicos/química , Antineoplásicos/farmacologia , Antineoplásicos/toxicidade , Interpretação Estatística de Dados , Mutagênese , Linguagens de Programação , Relação Quantitativa Estrutura-Atividade , Integração de Sistemas
14.
Curr Top Med Chem ; 12(18): 1980-6, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-23110533

RESUMO

We present the open source components for drug discovery that has been developed and integrated into the graphical workbench Bioclipse. Building on a solid open source cheminformatics core, Bioclipse has advanced functionality for managing and visualizing chemical structures and related information. The features presented here include QSAR/QSPR modeling, various predictive solutions such as decision support for chemical liability assessment, site-ofmetabolism prediction, virtual screening, and knowledge discovery and integration. We demonstrate the utility of the described tools with examples from computational pharmacology, toxicology, and ADME. Bioclipse is used in both academia and industry, and is a good example of open source leading to new solutions for drug discovery.


Assuntos
Descoberta de Drogas , Software , Absorção , Algoritmos , Técnicas de Apoio para a Decisão , Avaliação Pré-Clínica de Medicamentos , Farmacocinética , Toxicologia/métodos
15.
J Cheminform ; 3(1): 37, 2011 Oct 14.
Artigo em Inglês | MEDLINE | ID: mdl-21999342

RESUMO

BACKGROUND: The Blue Obelisk movement was established in 2005 as a response to the lack of Open Data, Open Standards and Open Source (ODOSOS) in chemistry. It aims to make it easier to carry out chemistry research by promoting interoperability between chemistry software, encouraging cooperation between Open Source developers, and developing community resources and Open Standards. RESULTS: This contribution looks back on the work carried out by the Blue Obelisk in the past 5 years and surveys progress and remaining challenges in the areas of Open Data, Open Standards, and Open Source in chemistry. CONCLUSIONS: We show that the Blue Obelisk has been very successful in bringing together researchers and developers with common interests in ODOSOS, leading to development of many useful resources freely available to the chemistry community.

16.
BMC Bioinformatics ; 12: 179, 2011 May 20.
Artigo em Inglês | MEDLINE | ID: mdl-21599898

RESUMO

BACKGROUND: Compound profiling and drug screening generates large amounts of data and is generally based on microplate assays. Current information systems used for handling this are mainly commercial, closed source, expensive, and heavyweight and there is a need for a flexible lightweight open system for handling plate design, and validation and preparation of data. RESULTS: A Bioclipse plugin consisting of a client part and a relational database was constructed. A multiple-step plate layout point-and-click interface was implemented inside Bioclipse. The system contains a data validation step, where outliers can be removed, and finally a plate report with all relevant calculated data, including dose-response curves. CONCLUSIONS: Brunn is capable of handling the data from microplate assays. It can create dose-response curves and calculate IC50 values. Using a system of this sort facilitates work in the laboratory. Being able to reuse already constructed plates and plate layouts by starting out from an earlier step in the plate layout design process saves time and cuts down on error sources.


Assuntos
Técnicas Citológicas/métodos , Avaliação Pré-Clínica de Medicamentos/métodos , Ensaios de Triagem em Larga Escala/métodos , Software , Bases de Dados Factuais , Relação Dose-Resposta a Droga
17.
J Biomed Semantics ; 2 Suppl 1: S6, 2011 Mar 07.
Artigo em Inglês | MEDLINE | ID: mdl-21388575

RESUMO

BACKGROUND: Semantic web technologies are finding their way into the life sciences. Ontologies and semantic markup have already been used for more than a decade in molecular sciences, but have not found widespread use yet. The semantic web technology Resource Description Framework (RDF) and related methods show to be sufficiently versatile to change that situation. RESULTS: The work presented here focuses on linking RDF approaches to existing molecular chemometrics fields, including cheminformatics, QSAR modeling and proteochemometrics. Applications are presented that link RDF technologies to methods from statistics and cheminformatics, including data aggregation, visualization, chemical identification, and property prediction. They demonstrate how this can be done using various existing RDF standards and cheminformatics libraries. For example, we show how IC50 and Ki values are modeled for a number of biological targets using data from the ChEMBL database. CONCLUSIONS: We have shown that existing RDF standards can suitably be integrated into existing molecular chemometrics methods. Platforms that unite these technologies, like Bioclipse, makes this even simpler and more transparent. Being able to create and share workflows that integrate data aggregation and analysis (visual and statistical) is beneficial to interoperability and reproducibility. The current work shows that RDF approaches are sufficiently powerful to support molecular chemometrics workflows.

18.
BMC Bioinformatics ; 10: 397, 2009 Dec 03.
Artigo em Inglês | MEDLINE | ID: mdl-19958528

RESUMO

BACKGROUND: Contemporary biological research integrates neighboring scientific domains to answer complex questions in fields such as systems biology and drug discovery. This calls for tools that are intuitive to use, yet flexible to adapt to new tasks. RESULTS: Bioclipse is a free, open source workbench with advanced features for the life sciences. Version 2.0 constitutes a complete rewrite of Bioclipse, and delivers a stable, scalable integration platform for developers and an intuitive workbench for end users. All functionality is available both from the graphical user interface and from a built-in novel domain-specific language, supporting the scientist in interdisciplinary research and reproducible analyses through advanced visualization of the inputs and the results. New components for Bioclipse 2 include a rewritten editor for chemical structures, a table for multiple molecules that supports gigabyte-sized files, as well as a graphical editor for sequences and alignments. CONCLUSION: Bioclipse 2 is equipped with advanced tools required to carry out complex analysis in the fields of bio- and cheminformatics. Developed as a Rich Client based on Eclipse, Bioclipse 2 leverages on today's powerful desktop computers for providing a responsive user interface, but also takes full advantage of the Web and networked (Web/Cloud) services for more demanding calculations or retrieval of data. The fact that Bioclipse 2 is based on an advanced and widely used service platform ensures wide extensibility, making it easy to add new algorithms, visualizations, as well as scripting commands. The intuitive tools for end users and the extensible architecture make Bioclipse 2 ideal for interdisciplinary and integrative research.Bioclipse 2 is released under the Eclipse Public License (EPL), a flexible open source license that allows additional plugins to be of any license. Bioclipse 2 is implemented in Java and supported on all major platforms; Source code and binaries are freely available at http://www.bioclipse.net.


Assuntos
Biologia Computacional/métodos , Software , Algoritmos , Disciplinas das Ciências Biológicas , Bases de Dados Factuais
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA