Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 70
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Artigo em Inglês | MEDLINE | ID: mdl-32750907

RESUMO

With the increasing amount of image data collected from biomedical experiments there is an urgent need for smarter and more effective analysis methods. Many scientific questions require analysis of image subregions related to some specific biology. Finding such regions of interest (ROIs) at low resolution and limiting the data subjected to final quantification at high resolution can reduce computational requirements and save time. In this paper we propose a three-step pipeline: First, bounding boxes for ROIs are located at low resolution. Next, ROIs are subjected to semantic segmentation into sub-regions at mid-resolution. We also estimate the confidence of the segmented sub-regions. Finally, quantitative measurements are extracted at high resolution. We use deep learning for the first two steps in the pipeline and conformal prediction for confidence assessment. We show that limiting final quantitative analysis to sub regions with high confidence reduces noise and increases separability of observed biological effects.

2.
Gigascience ; 9(5)2020 May 01.
Artigo em Inglês | MEDLINE | ID: mdl-32369166

RESUMO

BACKGROUND: Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive analyses. However, current MapReduce frameworks offer poor support for reusing existing processing tools in bioinformatics pipelines. Furthermore, these frameworks do not have native support for application containers, which are becoming popular in scientific data processing. RESULTS: Here we present MaRe, an open source programming library that introduces support for Docker containers in Apache Spark. Apache Spark and Docker are the MapReduce framework and container engine that have collected the largest open source community; thus, MaRe provides interoperability with the cutting-edge software ecosystem. We demonstrate MaRe on 2 data-intensive applications in life science, showing ease of use and scalability. CONCLUSIONS: MaRe enables scalable data-intensive processing in life science with Apache Spark and application containers. When compared with current best practices, which involve the use of workflow systems, MaRe has the advantage of providing data locality, ingestion from heterogeneous storage systems, and interactive processing. MaRe is generally applicable and available as open source software.

3.
J Chem Inf Model ; 60(6): 2830-2837, 2020 Jun 22.
Artigo em Inglês | MEDLINE | ID: mdl-32374618

RESUMO

Predictive modeling is a cornerstone in early drug development. Using information for multiple domains or across prediction tasks has the potential to improve the performance of predictive modeling. However, aggregating data often leads to incomplete data matrices that might be limiting for modeling. In line with previous studies, we show that by generating predicted bioactivity profiles, and using these as additional features, prediction accuracy of biological endpoints can be improved. Using conformal prediction, a type of confidence predictor, we present a robust framework for the calculation of these profiles and the evaluation of their impact. We report on the outcomes from several approaches to generate the predicted profiles on 16 datasets in cytotoxicity and bioactivity and show that efficiency is improved the most when including the p-values from conformal prediction as bioactivity profiles.

4.
Gigascience ; 8(5)2019 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-31029061

RESUMO

BACKGROUND: The complex nature of biological data has driven the development of specialized software tools. Scientific workflow management systems simplify the assembly of such tools into pipelines, assist with job automation, and aid reproducibility of analyses. Many contemporary workflow tools are specialized or not designed for highly complex workflows, such as with nested loops, dynamic scheduling, and parametrization, which is common in, e.g., machine learning. FINDINGS: SciPipe is a workflow programming library implemented in the programming language Go, for managing complex and dynamic pipelines in bioinformatics, cheminformatics, and other fields. SciPipe helps in particular with workflow constructs common in machine learning, such as extensive branching, parameter sweeps, and dynamic scheduling and parametrization of downstream tasks. SciPipe builds on flow-based programming principles to support agile development of workflows based on a library of self-contained, reusable components. It supports running subsets of workflows for improved iterative development and provides a data-centric audit logging feature that saves a full audit trace for every output file of a workflow, which can be converted to other formats such as HTML, TeX, and PDF on demand. The utility of SciPipe is demonstrated with a machine learning pipeline, a genomics, and a transcriptomics pipeline. CONCLUSIONS: SciPipe provides a solution for agile development of complex and dynamic pipelines, especially in machine learning, through a flexible application programming interface suitable for scientists used to programming or scripting.


Assuntos
Biologia Computacional , Genômica , Software , Biblioteca Gênica , Aprendizado de Máquina , Linguagens de Programação , Fluxo de Trabalho
5.
Sci Rep ; 9(1): 4129, 2019 03 11.
Artigo em Inglês | MEDLINE | ID: mdl-30858393

RESUMO

Huntington's disease (HD) is a severe neurological disease leading to psychiatric symptoms, motor impairment and cognitive decline. The disease is caused by a CAG expansion in the huntingtin (HTT) gene, but how this translates into the clinical phenotype of HD remains elusive. Using liquid chromatography mass spectrometry, we analyzed the metabolome of cerebrospinal fluid (CSF) from premanifest and manifest HD subjects as well as control subjects. Inter-group differences revealed that the tyrosine metabolism, including tyrosine, thyroxine, L-DOPA and dopamine, was significantly altered in manifest compared with premanifest HD. These metabolites demonstrated moderate to strong associations to measures of disease severity and symptoms. Thyroxine and dopamine also correlated with the five year risk of onset in premanifest HD subjects. The phenylalanine and the purine metabolisms were also significantly altered, but associated less to disease severity. Decreased levels of lumichrome were commonly found in mutated HTT carriers and the levels correlated with the five year risk of disease onset in premanifest carriers. These biochemical findings demonstrates that the CSF metabolome can be used to characterize molecular pathogenesis occurring in HD, which may be essential for future development of novel HD therapies.

6.
Bioinformatics ; 35(19): 3752-3760, 2019 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-30851093

RESUMO

MOTIVATION: Developing a robust and performant data analysis workflow that integrates all necessary components whilst still being able to scale over multiple compute nodes is a challenging task. We introduce a generic method based on the microservice architecture, where software tools are encapsulated as Docker containers that can be connected into scientific workflows and executed using the Kubernetes container orchestrator. RESULTS: We developed a Virtual Research Environment (VRE) which facilitates rapid integration of new tools and developing scalable and interoperable workflows for performing metabolomics data analysis. The environment can be launched on-demand on cloud resources and desktop computers. IT-expertise requirements on the user side are kept to a minimum, and workflows can be re-used effortlessly by any novice user. We validate our method in the field of metabolomics on two mass spectrometry, one nuclear magnetic resonance spectroscopy and one fluxomics study. We showed that the method scales dynamically with increasing availability of computational resources. We demonstrated that the method facilitates interoperability using integration of the major software suites resulting in a turn-key workflow encompassing all steps for mass-spectrometry-based metabolomics including preprocessing, statistics and identification. Microservices is a generic methodology that can serve any scientific discipline and opens up for new types of large-scale integrative science. AVAILABILITY AND IMPLEMENTATION: The PhenoMeNal consortium maintains a web portal (https://portal.phenomenal-h2020.eu) providing a GUI for launching the Virtual Research Environment. The GitHub repository https://github.com/phnmnl/ hosts the source code of all projects. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

7.
Cells ; 8(2)2019 01 24.
Artigo em Inglês | MEDLINE | ID: mdl-30678351

RESUMO

To better understand the pathophysiological differences between secondary progressive multiple sclerosis (SPMS) and relapsing-remitting multiple sclerosis (RRMS), and to identify potential biomarkers of disease progression, we applied high-resolution mass spectrometry (HRMS) to investigate the metabolome of cerebrospinal fluid (CSF). The biochemical differences were determined using partial least squares discriminant analysis (PLS-DA) and connected to biochemical pathways as well as associated to clinical and radiological measures. Tryptophan metabolism was significantly altered, with perturbed levels of kynurenate, 5-hydroxytryptophan, 5-hydroxyindoleacetate, and N-acetylserotonin in SPMS patients compared with RRMS and controls. SPMS patients had altered kynurenine compared with RRMS patients, and altered indole-3-acetate compared with controls. Regarding the pyrimidine metabolism, SPMS patients had altered levels of uridine and deoxyuridine compared with RRMS and controls, and altered thymine and glutamine compared with RRMS patients. Metabolites from the pyrimidine metabolism were significantly associated with disability, disease activity and brain atrophy, making them of particular interest for understanding the disease mechanisms and as markers of disease progression. Overall, these findings are of importance for the characterization of the molecular pathogenesis of SPMS and support the hypothesis that the CSF metabolome may be used to explore changes that occur in the transition between the RRMS and SPMS pathologies.


Assuntos
Esclerose Múltipla Recidivante-Remitente/líquido cefalorraquidiano , Adulto , Estudos de Casos e Controles , Feminino , Humanos , Cinurenina/metabolismo , Masculino , Metaboloma , Pessoa de Meia-Idade , Esclerose Múltipla Crônica Progressiva/líquido cefalorraquidiano , Fenilalanina/metabolismo , Pirimidinas/metabolismo , Serotonina/metabolismo , Triptofano/metabolismo
8.
SLAS Discov ; 24(4): 466-475, 2019 04.
Artigo em Inglês | MEDLINE | ID: mdl-30641024

RESUMO

The quantification and identification of cellular phenotypes from high-content microscopy images has proven to be very useful for understanding biological activity in response to different drug treatments. The traditional approach has been to use classical image analysis to quantify changes in cell morphology, which requires several nontrivial and independent analysis steps. Recently, convolutional neural networks have emerged as a compelling alternative, offering good predictive performance and the possibility to replace traditional workflows with a single network architecture. In this study, we applied the pretrained deep convolutional neural networks ResNet50, InceptionV3, and InceptionResnetV2 to predict cell mechanisms of action in response to chemical perturbations for two cell profiling datasets from the Broad Bioimage Benchmark Collection. These networks were pretrained on ImageNet, enabling much quicker model training. We obtain higher predictive accuracy than previously reported, between 95% and 97%. The ability to quickly and accurately distinguish between different cell morphologies from a scarce amount of labeled data illustrates the combined benefit of transfer learning and deep convolutional neural networks for interrogating cell-based images.


Assuntos
Aprendizado Profundo , Redes Neurais de Computação , Conjuntos de Dados como Assunto , Humanos , Células MCF-7
9.
Bioinformatics ; 35(5): 839-846, 2019 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-30101309

RESUMO

MOTIVATION: Computational biologists face many challenges related to data size, and they need to manage complicated analyses often including multiple stages and multiple tools, all of which must be deployed to modern infrastructures. To address these challenges and maintain reproducibility of results, researchers need (i) a reliable way to run processing stages in any computational environment, (ii) a well-defined way to orchestrate those processing stages and (iii) a data management layer that tracks data as it moves through the processing pipeline. RESULTS: Pachyderm is an open-source workflow system and data management framework that fulfils these needs by creating a data pipelining and data versioning layer on top of projects from the container ecosystem, having Kubernetes as the backbone for container orchestration. We adapted Pachyderm and demonstrated its attractive properties in bioinformatics. A Helm Chart was created so that researchers can use Pachyderm in multiple scenarios. The Pachyderm File System was extended to support block storage. A wrapper for initiating Pachyderm on cloud-agnostic virtual infrastructures was created. The benefits of Pachyderm are illustrated via a large metabolomics workflow, demonstrating that Pachyderm enables efficient and sustainable data science workflows while maintaining reproducibility and scalability. AVAILABILITY AND IMPLEMENTATION: Pachyderm is available from https://github.com/pachyderm/pachyderm. The Pachyderm Helm Chart is available from https://github.com/kubernetes/charts/tree/master/stable/pachyderm. Pachyderm is available out-of-the-box from the PhenoMeNal VRE (https://github.com/phnmnl/KubeNow-plugin) and general Kubernetes environments instantiated via KubeNow. The code of the workflow used for the analysis is available on GitHub (https://github.com/pharmbio/LC-MS-Pachyderm). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional , Software , Ecossistema , Reprodutibilidade dos Testes , Fluxo de Trabalho
10.
Cytometry A ; 95(4): 366-380, 2019 04.
Artigo em Inglês | MEDLINE | ID: mdl-30565841

RESUMO

Artificial intelligence, deep convolutional neural networks, and deep learning are all niche terms that are increasingly appearing in scientific presentations as well as in the general media. In this review, we focus on deep learning and how it is applied to microscopy image data of cells and tissue samples. Starting with an analogy to neuroscience, we aim to give the reader an overview of the key concepts of neural networks, and an understanding of how deep learning differs from more classical approaches for extracting information from image data. We aim to increase the understanding of these methods, while highlighting considerations regarding input data requirements, computational resources, challenges, and limitations. We do not provide a full manual for applying these methods to your own data, but rather review previously published articles on deep learning in image cytometry, and guide the readers toward further reading on specific networks and methods, including new methods not yet applied to cytometry data. © 2018 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of International Society for Advancement of Cytometry.

11.
Gigascience ; 8(2)2019 02 01.
Artigo em Inglês | MEDLINE | ID: mdl-30535405

RESUMO

BACKGROUND: Metabolomics is the comprehensive study of a multitude of small molecules to gain insight into an organism's metabolism. The research field is dynamic and expanding with applications across biomedical, biotechnological, and many other applied biological domains. Its computationally intensive nature has driven requirements for open data formats, data repositories, and data analysis tools. However, the rapid progress has resulted in a mosaic of independent, and sometimes incompatible, analysis methods that are difficult to connect into a useful and complete data analysis solution. FINDINGS: PhenoMeNal (Phenome and Metabolome aNalysis) is an advanced and complete solution to set up Infrastructure-as-a-Service (IaaS) that brings workflow-oriented, interoperable metabolomics data analysis platforms into the cloud. PhenoMeNal seamlessly integrates a wide array of existing open-source tools that are tested and packaged as Docker containers through the project's continuous integration process and deployed based on a kubernetes orchestration framework. It also provides a number of standardized, automated, and published analysis workflows in the user interfaces Galaxy, Jupyter, Luigi, and Pachyderm. CONCLUSIONS: PhenoMeNal constitutes a keystone solution in cloud e-infrastructures available for metabolomics. PhenoMeNal is a unique and complete solution for setting up cloud e-infrastructures through easy-to-use web interfaces that can be scaled to any custom public and private cloud environment. By harmonizing and automating software installation and configuration and through ready-to-use scientific workflow user interfaces, PhenoMeNal has succeeded in providing scientists with workflow-driven, reproducible, and shareable metabolomics data analysis platforms that are interfaced through standard data formats, representative datasets, versioned, and have been tested for reproducibility and interoperability. The elastic implementation of PhenoMeNal further allows easy adaptation of the infrastructure to other application areas and 'omics research domains.


Assuntos
Metabolômica/métodos , Software , Computação em Nuvem , Humanos , Fluxo de Trabalho
12.
Front Pharmacol ; 9: 1256, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30459617

RESUMO

Ligand-based models can be used in drug discovery to obtain an early indication of potential off-target interactions that could be linked to adverse effects. Another application is to combine such models into a panel, allowing to compare and search for compounds with similar profiles. Most contemporary methods and implementations however lack valid measures of confidence in their predictions, and only provide point predictions. We here describe a methodology that uses Conformal Prediction for predicting off-target interactions, with models trained on data from 31 targets in the ExCAPE-DB dataset selected for their utility in broad early hazard assessment. Chemicals were represented by the signature molecular descriptor and support vector machines were used as the underlying machine learning method. By using conformal prediction, the results from predictions come in the form of confidence p-values for each class. The full pre-processing and model training process is openly available as scientific workflows on GitHub, rendering it fully reproducible. We illustrate the usefulness of the developed methodology on a set of compounds extracted from DrugBank. The resulting models are published online and are available via a graphical web interface and an OpenAPI interface for programmatic access.

13.
J Cheminform ; 10(1): 49, 2018 Oct 11.
Artigo em Inglês | MEDLINE | ID: mdl-30306349

RESUMO

Ligand-based predictive modeling is widely used to generate predictive models aiding decision making in e.g. drug discovery projects. With growing data sets and requirements on low modeling time comes the necessity to analyze data sets efficiently to support rapid and robust modeling. In this study we analyzed four data sets and studied the efficiency of machine learning methods on sparse data structures, utilizing Morgan fingerprints of different radii and hash sizes, and compared with molecular signatures descriptor of different height. We specifically evaluated the effect these parameters had on modeling time, predictive performance, and memory requirements using two implementations of random forest; Scikit-learn as well as FEST. We also compared with a support vector machine implementation. Our results showed that unhashed fingerprints yield significantly better accuracy than hashed fingerprints ([Formula: see text]), with no pronounced deterioration in modeling time and memory usage. Furthermore, the fast execution and low memory usage of the FEST algorithm suggest that it is a good alternative for large, high dimensional sparse data. Both support vector machines and random forest performed equally well but results indicate that the support vector machine was better at using the extra information from larger values of the Morgan fingerprint's radius.

14.
J Cheminform ; 10(1): 46, 2018 Sep 06.
Artigo em Inglês | MEDLINE | ID: mdl-30191348
15.
Theranostics ; 8(16): 4477-4490, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-30214633

RESUMO

Molecular networks in neurological diseases are complex. Despite this fact, contemporary biomarkers are in most cases interpreted in isolation, leading to a significant loss of information and power. We present an analytical approach to scrutinize and combine information from biomarkers originating from multiple sources with the aim of discovering a condensed set of biomarkers that in combination could distinguish the progressive degenerative phenotype of multiple sclerosis (SPMS) from the relapsing-remitting phenotype (RRMS). Methods: Clinical and magnetic resonance imaging (MRI) data were integrated with data from protein and metabolite measurements of cerebrospinal fluid, and a method was developed to sift through all the variables to establish a small set of highly informative measurements. This prospective study included 16 SPMS patients, 30 RRMS patients and 10 controls. Protein concentrations were quantitated with multiplexed fluorescent bead-based immunoassays and ELISA. The metabolome was recorded using liquid chromatography-mass spectrometry. Clinical follow-up data of the SPMS patients were used to assess disease progression and development of disability. Results: Eleven variables were in combination able to distinguish SPMS from RRMS patients with high confidence superior to any single measurement. The identified variables consisted of three MRI variables: the size of the spinal cord and the third ventricle and the total number of T1 hypointense lesions; six proteins: galectin-9, monocyte chemoattractant protein-1 (MCP-1), transforming growth factor alpha (TGF-α), tumor necrosis factor alpha (TNF-α), soluble CD40L (sCD40L) and platelet-derived growth factor AA (PDGF-AA); and two metabolites: 20ß-dihydrocortisol (20ß-DHF) and indolepyruvate. The proteins myelin basic protein (MBP) and macrophage-derived chemokine (MDC), as well as the metabolites 20ß-DHF and 5,6-dihydroxyprostaglandin F1a (5,6-DH-PGF1), were identified as potential biomarkers of disability progression. Conclusion: Our study demonstrates, in a limited but well-defined and data-rich cohort, the importance and value of combining multiple biomarkers to aid diagnostics and track disease progression.


Assuntos
Fatores Biológicos/análise , Biomarcadores/análise , Biomarcadores/líquido cefalorraquidiano , Líquido Cefalorraquidiano/química , Imagem por Ressonância Magnética/métodos , Esclerose Múltipla Crônica Progressiva/diagnóstico , Proteínas/análise , Adulto , Idoso , Cromatografia Líquida , Diagnóstico Precoce , Feminino , Humanos , Imunoensaio , Masculino , Espectrometria de Massas , Metabolômica , Pessoa de Meia-Idade , Esclerose Múltipla Crônica Progressiva/diagnóstico por imagem , Estudos Prospectivos , Proteômica
16.
J Chem Inf Model ; 58(5): 1132-1140, 2018 05 29.
Artigo em Inglês | MEDLINE | ID: mdl-29701973

RESUMO

Making predictions with an associated confidence is highly desirable as it facilitates decision making and resource prioritization. Conformal regression is a machine learning framework that allows the user to define the required confidence and delivers predictions that are guaranteed to be correct to the selected extent. In this study, we apply conformal regression to model molecular properties and bioactivity values and investigate different ways to scale the resultant prediction intervals to create as efficient (i.e., narrow) regressors as possible. Different algorithms to estimate the prediction uncertainty were used to normalize the prediction ranges, and the different approaches were evaluated on 29 publicly available data sets. Our results show that the most efficient conformal regressors are obtained when using the natural exponential of the ensemble standard deviation from the underlying random forest to scale the prediction intervals, but other approaches were almost as efficient. This approach afforded an average prediction range of 1.65 pIC50 units at the 80% confidence level when applied to bioactivity modeling. The choice of nonconformity function has a pronounced impact on the average prediction range with a difference of close to one log unit in bioactivity between the tightest and widest prediction range. Overall, conformal regression is a robust approach to generate bioactivity predictions with associated confidence.


Assuntos
Informática/métodos , Aprendizado de Máquina , Relação Quantitativa Estrutura-Atividade , Incerteza , Tomada de Decisões
17.
J Cheminform ; 10(1): 17, 2018 Apr 03.
Artigo em Inglês | MEDLINE | ID: mdl-29616425

RESUMO

Lipophilicity is a major determinant of ADMET properties and overall suitability of drug candidates. We have developed large-scale models to predict water-octanol distribution coefficient (logD) for chemical compounds, aiding drug discovery projects. Using ACD/logD data for 1.6 million compounds from the ChEMBL database, models are created and evaluated by a support-vector machine with a linear kernel using conformal prediction methodology, outputting prediction intervals at a specified confidence level. The resulting model shows a predictive ability of [Formula: see text] and with the best performing nonconformity measure having median prediction interval of [Formula: see text] log units at 80% confidence and [Formula: see text] log units at 90% confidence. The model is available as an online service via an OpenAPI interface, a web page with a molecular editor, and we also publish predictive values at 90% confidence level for 91 M PubChem structures in RDF format for download and as an URI resolver service.

18.
Gigascience ; 7(5)2018 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-29659792

RESUMO

Background: Next-generation sequencing (NGS) has transformed the life sciences, and many research groups are newly dependent upon computer clusters to store and analyze large datasets. This creates challenges for e-infrastructures accustomed to hosting computationally mature research in other sciences. Using data gathered from our own clusters at UPPMAX computing center at Uppsala University, Sweden, where core hour usage of ∼800 NGS and ∼200 non-NGS projects is now similar, we compare and contrast the growth, administrative burden, and cluster usage of NGS projects with projects from other sciences. Results: The number of NGS projects has grown rapidly since 2010, with growth driven by entry of new research groups. Storage used by NGS projects has grown more rapidly since 2013 and is now limited by disk capacity. NGS users submit nearly twice as many support tickets per user, and 11 more tools are installed each month for NGS projects than for non-NGS projects. We developed usage and efficiency metrics and show that computing jobs for NGS projects use more RAM than non-NGS projects, are more variable in core usage, and rarely span multiple nodes. NGS jobs use booked resources less efficiently for a variety of reasons. Active monitoring can improve this somewhat. Conclusions: Hosting NGS projects imposes a large administrative burden at UPPMAX due to large numbers of inexperienced users and diverse and rapidly evolving research areas. We provide a set of recommendations for e-infrastructures that host NGS research projects. We provide anonymized versions of our storage, job, and efficiency databases.


Assuntos
Disciplinas das Ciências Biológicas/métodos , Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Pesquisa , Software
19.
J Cheminform ; 10(1): 8, 2018 Mar 01.
Artigo em Inglês | MEDLINE | ID: mdl-29492726

RESUMO

BACKGROUND: Docking and scoring large libraries of ligands against target proteins forms the basis of structure-based virtual screening. The problem is trivially parallelizable, and calculations are generally carried out on computer clusters or on large workstations in a brute force manner, by docking and scoring all available ligands. CONTRIBUTION: In this study we propose a strategy that is based on iteratively docking a set of ligands to form a training set, training a ligand-based model on this set, and predicting the remainder of the ligands to exclude those predicted as 'low-scoring' ligands. Then, another set of ligands are docked, the model is retrained and the process is repeated until a certain model efficiency level is reached. Thereafter, the remaining ligands are docked or excluded based on this model. We use SVM and conformal prediction to deliver valid prediction intervals for ranking the predicted ligands, and Apache Spark to parallelize both the docking and the modeling. RESULTS: We show on 4 different targets that conformal prediction based virtual screening (CPVS) is able to reduce the number of docked molecules by 62.61% while retaining an accuracy for the top 30 hits of 94% on average and a speedup of 3.7. The implementation is available as open source via GitHub ( https://github.com/laeeq80/spark-cpvs ) and can be run on high-performance computers as well as on cloud resources.

20.
J Cheminform ; 9(1): 33, 2017 Jun 06.
Artigo em Inglês | MEDLINE | ID: mdl-29086040

RESUMO

BACKGROUND: The Chemistry Development Kit (CDK) is a widely used open source cheminformatics toolkit, providing data structures to represent chemical concepts along with methods to manipulate such structures and perform computations on them. The library implements a wide variety of cheminformatics algorithms ranging from chemical structure canonicalization to molecular descriptor calculations and pharmacophore perception. It is used in drug discovery, metabolomics, and toxicology. Over the last 10 years, the code base has grown significantly, however, resulting in many complex interdependencies among components and poor performance of many algorithms. RESULTS: We report improvements to the CDK v2.0 since the v1.2 release series, specifically addressing the increased functional complexity and poor performance. We first summarize the addition of new functionality, such atom typing and molecular formula handling, and improvement to existing functionality that has led to significantly better performance for substructure searching, molecular fingerprints, and rendering of molecules. Second, we outline how the CDK has evolved with respect to quality control and the approaches we have adopted to ensure stability, including a code review mechanism. CONCLUSIONS: This paper highlights our continued efforts to provide a community driven, open source cheminformatics library, and shows that such collaborative projects can thrive over extended periods of time, resulting in a high-quality and performant library. By taking advantage of community support and contributions, we show that an open source cheminformatics project can act as a peer reviewed publishing platform for scientific computing software. Graphical abstract CDK 2.0 provides new features and improved performance.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA