Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 33
Filtrar
1.
Nucleic Acids Res ; 52(W1): W294-W298, 2024 Jul 05.
Artigo em Inglês | MEDLINE | ID: mdl-38619040

RESUMO

When preparing biomolecular structures for molecular dynamics simulations, pKa calculations are required to provide at least a representative protonation state at a given pH value. Neglecting this step and adopting the reference protonation states of the amino acid residues in water, often leads to wrong electrostatics and nonphysical simulations. Fortunately, several methods have been developed to prepare structures considering the protonation preference of residues in their specific environments (pKa values), and some are even available for online usage. In this work, we present the PypKa server, which allows users to run physics-based, as well as ML-accelerated methods suitable for larger systems, to obtain pKa values, isoelectric points, titration curves, and structures with representative pH-dependent protonation states compatible with commonly used force fields (AMBER, CHARMM, GROMOS). The user may upload a custom structure or submit an identifier code from PBD or UniProtKB. The results for over 200k structures taken from the Protein Data Bank and the AlphaFold DB have been precomputed, and their data can be retrieved without extra calculations. All this information can also be obtained from an application programming interface (API) facilitating its usage and integration into existing pipelines as well as other web services. The web server is available at pypka.org.


Assuntos
Bases de Dados de Proteínas , Internet , Simulação de Dinâmica Molecular , Software , Concentração de Íons de Hidrogênio , Conformação Proteica , Proteínas/química , Prótons , Eletricidade Estática
2.
Bioinformatics ; 40(1)2024 01 02.
Artigo em Inglês | MEDLINE | ID: mdl-38175786

RESUMO

SUMMARY: We created bigwig-loader, a data-loader for epigenetic profiles from BigWig files that decompresses and processes information for multiple intervals from multiple BigWig files in parallel. This is an access pattern needed to create training batches for typical machine learning models on epigenetics data. Using a new codec, the decompression can be done on a graphical processing unit (GPU) making it fast enough to create the training batches during training, mitigating the need for saving preprocessed training examples to disk. AVAILABILITY AND IMPLEMENTATION: The bigwig-loader installation instructions and source code can be accessed at https://github.com/pfizer-opensource/bigwig-loader.


Assuntos
Epigenômica , Software , Epigênese Genética
3.
J Chem Inf Model ; 64(7): 2331-2344, 2024 Apr 08.
Artigo em Inglês | MEDLINE | ID: mdl-37642660

RESUMO

Federated multipartner machine learning has been touted as an appealing and efficient method to increase the effective training data volume and thereby the predictivity of models, particularly when the generation of training data is resource-intensive. In the landmark MELLODDY project, indeed, each of ten pharmaceutical companies realized aggregated improvements on its own classification or regression models through federated learning. To this end, they leveraged a novel implementation extending multitask learning across partners, on a platform audited for privacy and security. The experiments involved an unprecedented cross-pharma data set of 2.6+ billion confidential experimental activity data points, documenting 21+ million physical small molecules and 40+ thousand assays in on-target and secondary pharmacodynamics and pharmacokinetics. Appropriate complementary metrics were developed to evaluate the predictive performance in the federated setting. In addition to predictive performance increases in labeled space, the results point toward an extended applicability domain in federated learning. Increases in collective training data volume, including by means of auxiliary data resulting from single concentration high-throughput and imaging assays, continued to boost predictive performance, albeit with a saturating return. Markedly higher improvements were observed for the pharmacokinetics and safety panel assay-based task subsets.


Assuntos
Benchmarking , Relação Quantitativa Estrutura-Atividade , Bioensaio , Aprendizado de Máquina
4.
Chem Res Toxicol ; 2023 Sep 10.
Artigo em Inglês | MEDLINE | ID: mdl-37690056

RESUMO

Predictive modeling of toxicity is a crucial step in the drug discovery pipeline. It can help filter out molecules with a high probability of failing in the early stages of de novo drug design. Thus, several machine learning (ML) models have been developed to predict the toxicity of molecules by combining classical ML techniques or deep neural networks with well-known molecular representations such as fingerprints or 2D graphs. But the more natural, accurate representation of molecules is expected to be defined in physical 3D space like in ab initio methods. Recent studies successfully used equivariant graph neural networks (EGNNs) for representation learning based on 3D structures to predict quantum-mechanical properties of molecules. Inspired by this, we investigated the performance of EGNNs to construct reliable ML models for toxicity prediction. We used the equivariant transformer (ET) model in TorchMD-NET for this. Eleven toxicity data sets taken from MoleculeNet, TDCommons, and ToxBenchmark have been considered to evaluate the capability of ET for toxicity prediction. Our results show that ET adequately learns 3D representations of molecules that can successfully correlate with toxicity activity, achieving good accuracies on most data sets comparable to state-of-the-art models. We also test a physicochemical property, namely, the total energy of a molecule, to inform the toxicity prediction with a physical prior. However, our work suggests that these two properties can not be related. We also provide an attention weight analysis for helping to understand the toxicity prediction in 3D space and thus increase the explainability of the ML model. In summary, our findings offer promising insights considering 3D geometry information via EGNNs and provide a straightforward way to integrate molecular conformers into ML-based pipelines for predicting and investigating toxicity prediction in physical space. We expect that in the future, especially for larger, more diverse data sets, EGNNs will be an essential tool in this domain.

5.
Bioinformatics ; 38(1): 297-298, 2021 12 22.
Artigo em Inglês | MEDLINE | ID: mdl-34260689

RESUMO

SUMMARY: pKa values of ionizable residues and isoelectric points of proteins provide valuable local and global insights about their structure and function. These properties can be estimated with reasonably good accuracy using Poisson-Boltzmann and Monte Carlo calculations at a considerable computational cost (from some minutes to several hours). pKPDB is a database of over 12 M theoretical pKa values calculated over 120k protein structures deposited in the Protein Data Bank. By providing precomputed pKa and pI values, users can retrieve results instantaneously for their protein(s) of interest while also saving countless hours and resources that would be spent on repeated calculations. Furthermore, there is an ever-growing imbalance between experimental pKa and pI values and the number of resolved structures. This database will complement the experimental and computational data already available and can also provide crucial information regarding buried residues that are under-represented in experimental measurements. AVAILABILITY AND IMPLEMENTATION: Gzipped csv files containing p Ka and isoelectric point values can be downloaded from https://pypka.org/pKPDB. To query a single PDB code please use the PypKa free server at https://pypka.org. The pKPDB source code can be found at https://github.com/mms-fcul/pKPDB. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Proteínas , Software , Bases de Dados de Proteínas
6.
Bioinformatics ; 37(6): 861-867, 2021 05 05.
Artigo em Inglês | MEDLINE | ID: mdl-33241296

RESUMO

MOTIVATION: Image-based profiling combines high-throughput screening with multiparametric feature analysis to capture the effect of perturbations on biological systems. This technology has attracted increasing interest in the field of plant phenotyping, promising to accelerate the discovery of novel herbicides. However, the extraction of meaningful features from unlabeled plant images remains a big challenge. RESULTS: We describe a novel data-driven approach to find feature representations from plant time-series images in a self-supervised manner by using time as a proxy for image similarity. In the spirit of transfer learning, we first apply an ImageNet-pretrained architecture as a base feature extractor. Then, we extend this architecture with a triplet network to refine and reduce the dimensionality of extracted features by ranking relative similarities between consecutive and non-consecutive time points. Without using any labels, we produce compact, organized representations of plant phenotypes and demonstrate their superior applicability to clustering, image retrieval and classification tasks. Besides time, our approach could be applied using other surrogate measures of phenotype similarity, thus providing a versatile method of general interest to the phenotypic profiling community. AVAILABILITY AND IMPLEMENTATION: Source code is provided in https://github.com/bayer-science-for-a-better-life/plant-triplet-net. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Plantas , Software , Análise por Conglomerados
7.
Bioinformatics ; 36(13): 4093-4094, 2020 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-32369561

RESUMO

SUMMARY: Optimizing small molecules in a drug discovery project is a notoriously difficult task as multiple molecular properties have to be considered and balanced at the same time. In this work, we present our novel interactive in silico compound optimization platform termed grünifai to support the ideation of the next generation of compounds under the constraints of a multiparameter objective. grünifai integrates adjustable in silico models, a continuous representation of the chemical space, a scalable particle swarm optimization algorithm and the possibility to actively steer the compound optimization through providing feedback on generated intermediate structures. AVAILABILITY AND IMPLEMENTATION: Source code and documentation are freely available under an MIT license and are openly available on GitHub (https://github.com/jrwnter/gruenifai). The backend, including the optimization method and distribution on multiple GPU nodes is written in Python 3. The frontend is written in ReactJS.


Assuntos
Algoritmos , Software , Simulação por Computador , Documentação , Projetos de Pesquisa
8.
Int J Mol Sci ; 22(23)2021 Nov 28.
Artigo em Inglês | MEDLINE | ID: mdl-34884688

RESUMO

In silico protein-ligand binding prediction is an ongoing area of research in computational chemistry and machine learning based drug discovery, as an accurate predictive model could greatly reduce the time and resources necessary for the detection and prioritization of possible drug candidates. Proteochemometric modeling (PCM) attempts to create an accurate model of the protein-ligand interaction space by combining explicit protein and ligand descriptors. This requires the creation of information-rich, uniform and computer interpretable representations of proteins and ligands. Previous studies in PCM modeling rely on pre-defined, handcrafted feature extraction methods, and many methods use protein descriptors that require alignment or are otherwise specific to a particular group of related proteins. However, recent advances in representation learning have shown that unsupervised machine learning can be used to generate embeddings that outperform complex, human-engineered representations. Several different embedding methods for proteins and molecules have been developed based on various language-modeling methods. Here, we demonstrate the utility of these unsupervised representations and compare three protein embeddings and two compound embeddings in a fair manner. We evaluate performance on various splits of a benchmark dataset, as well as on an internal dataset of protein-ligand binding activities and find that unsupervised-learned representations significantly outperform handcrafted representations.


Assuntos
Quimioinformática/métodos , Proteínas/metabolismo , Aprendizado de Máquina não Supervisionado , Ligantes , Relação Quantitativa Estrutura-Atividade
9.
Bioinformatics ; 35(13): 2309-2310, 2019 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-30445568

RESUMO

SUMMARY: Single-guide RNAs (sgRNAs) targeting the same gene can significantly vary in terms of efficacy and specificity. PAVOOC (Prediction And Visualization of On- and Off-targets for CRISPR) is a web-based CRISPR sgRNA design tool that employs state of the art machine learning models to prioritize most effective candidate sgRNAs. In contrast to other tools, it maps sgRNAs to functional domains and protein structures and visualizes cut sites on corresponding protein crystal structures. Furthermore, PAVOOC supports homology-directed repair template generation for genome editing experiments and the visualization of the mutated amino acids in 3D. AVAILABILITY AND IMPLEMENTATION: PAVOOC is available under https://pavooc.me and accessible using modern browsers (Chrome/Chromium recommended). The source code is hosted at github.com/moritzschaefer/pavooc under the MIT License. The backend, including data processing steps, and the frontend are implemented in Python 3 and ReactJS, respectively. All components run in a simple Docker environment. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Repetições Palindrômicas Curtas Agrupadas e Regularmente Espaçadas , Algoritmos , Sistemas CRISPR-Cas , Edição de Genes , RNA Guia de Cinetoplastídeos , Software
10.
J Chem Inf Model ; 59(3): 1163-1171, 2019 03 25.
Artigo em Inglês | MEDLINE | ID: mdl-30840449

RESUMO

Predicting the outcome of biological assays based on high-throughput imaging data is a highly promising task in drug discovery since it can tremendously increase hit rates and suggest novel chemical scaffolds. However, end-to-end learning with convolutional neural networks (CNNs) has not been assessed for the task biological assay prediction despite the success of these networks at visual recognition. We compared several CNNs trained directly on high-throughput imaging data to a) CNNs trained on cell-centric crops and to b) the current state-of-the-art: fully connected networks trained on precalculated morphological cell features. The comparison was performed on the Cell Painting data set, the largest publicly available data set of microscopic images of cells with approximately 30,000 compound treatments. We found that CNNs perform significantly better at predicting the outcome of assays than fully connected networks operating on precomputed morphological features of cells. Surprisingly, the best performing method could predict 32% of the 209 biological assays at high predictive performance (AUC > 0.9) indicating that the cell morphology changes contain a large amount of information about compound activities. Our results suggest that many biological assays could be replaced by high-throughput imaging together with convolutional neural networks and that the costly cell segmentation and feature extraction step can be replaced by convolutional neural networks.


Assuntos
Bioensaio , Biologia Computacional/métodos , Microscopia , Redes Neurais de Computação , Processamento de Imagem Assistida por Computador
11.
Molecules ; 25(1)2019 Dec 21.
Artigo em Inglês | MEDLINE | ID: mdl-31877719

RESUMO

Simple physico-chemical properties, like logD, solubility, or melting point, can reveal a great deal about how a compound under development might later behave. These data are typically measured for most compounds in drug discovery projects in a medium throughput fashion. Collecting and assembling all the Bayer in-house data related to these properties allowed us to apply powerful machine learning techniques to predict the outcome of those assays for new compounds. In this paper, we report our finding that, especially for predicting physicochemical ADMET endpoints, a multitask graph convolutional approach appears a highly competitive choice. For seven endpoints of interest, we compared the performance of that approach to fully connected neural networks and different single task models. The new model shows increased predictive performance compared to previous modeling methods and will allow early prioritization of compounds even before they are synthesized. In addition, our model follows the generalized solubility equation without being explicitly trained under this constraint.


Assuntos
Descoberta de Drogas/métodos , Preparações Farmacêuticas/química , Algoritmos , Aprendizado de Máquina , Modelos Químicos , Redes Neurais de Computação , Preparações Farmacêuticas/síntese química , Relação Quantitativa Estrutura-Atividade
12.
Bioinformatics ; 33(14): i59-i66, 2017 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-28881961

RESUMO

MOTIVATION: Biclustering has become a major tool for analyzing large datasets given as matrix of samples times features and has been successfully applied in life sciences and e-commerce for drug design and recommender systems, respectively. actor nalysis for cluster cquisition (FABIA), one of the most successful biclustering methods, is a generative model that represents each bicluster by two sparse membership vectors: one for the samples and one for the features. However, FABIA is restricted to about 20 code units because of the high computational complexity of computing the posterior. Furthermore, code units are sometimes insufficiently decorrelated and sample membership is difficult to determine. We propose to use the recently introduced unsupervised Deep Learning approach Rectified Factor Networks (RFNs) to overcome the drawbacks of existing biclustering methods. RFNs efficiently construct very sparse, non-linear, high-dimensional representations of the input via their posterior means. RFN learning is a generalized alternating minimization algorithm based on the posterior regularization method which enforces non-negative and normalized posterior means. Each code unit represents a bicluster, where samples for which the code unit is active belong to the bicluster and features that have activating weights to the code unit belong to the bicluster. RESULTS: On 400 benchmark datasets and on three gene expression datasets with known clusters, RFN outperformed 13 other biclustering methods including FABIA. On data of the 1000 Genomes Project, RFN could identify DNA segments which indicate, that interbreeding with other hominins starting already before ancestors of modern humans left Africa. AVAILABILITY AND IMPLEMENTATION: https://github.com/bioinf-jku/librfn. CONTACT: djork-arne.clevert@bayer.com or hochreit@bioinf.jku.at.


Assuntos
Biologia Computacional/métodos , Aprendizado de Máquina não Supervisionado , Evolução Molecular , Perfilação da Expressão Gênica/métodos , Genoma Humano , Genômica/métodos , Humanos , Análise de Sequência com Séries de Oligonucleotídeos/métodos
15.
Nucleic Acids Res ; 40(9): e69, 2012 May.
Artigo em Inglês | MEDLINE | ID: mdl-22302147

RESUMO

Quantitative analyses of next-generation sequencing (NGS) data, such as the detection of copy number variations (CNVs), remain challenging. Current methods detect CNVs as changes in the depth of coverage along chromosomes. Technological or genomic variations in the depth of coverage thus lead to a high false discovery rate (FDR), even upon correction for GC content. In the context of association studies between CNVs and disease, a high FDR means many false CNVs, thereby decreasing the discovery power of the study after correction for multiple testing. We propose 'Copy Number estimation by a Mixture Of PoissonS' (cn.MOPS), a data processing pipeline for CNV detection in NGS data. In contrast to previous approaches, cn.MOPS incorporates modeling of depths of coverage across samples at each genomic position. Therefore, cn.MOPS is not affected by read count variations along chromosomes. Using a Bayesian approach, cn.MOPS decomposes variations in the depth of coverage across samples into integer copy numbers and noise by means of its mixture components and Poisson distributions, respectively. The noise estimate allows for reducing the FDR by filtering out detections having high noise that are likely to be false detections. We compared cn.MOPS with the five most popular methods for CNV detection in NGS data using four benchmark datasets: (i) simulated data, (ii) NGS data from a male HapMap individual with implanted CNVs from the X chromosome, (iii) data from HapMap individuals with known CNVs, (iv) high coverage data from the 1000 Genomes Project. cn.MOPS outperformed its five competitors in terms of precision (1-FDR) and recall for both gains and losses in all benchmark data sets. The software cn.MOPS is publicly available as an R package at http://www.bioinf.jku.at/software/cnmops/ and at Bioconductor.


Assuntos
Variações do Número de Cópias de DNA , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA , Software , Cromossomos Humanos X/química , Projeto HapMap , Humanos , Masculino , Distribuição de Poisson
16.
Nat Commun ; 15(1): 5640, 2024 Jul 05.
Artigo em Inglês | MEDLINE | ID: mdl-38965235

RESUMO

The Structural Genomics Consortium is an international open science research organization with a focus on accelerating early-stage drug discovery, namely hit discovery and optimization. We, as many others, believe that artificial intelligence (AI) is poised to be a main accelerator in the field. The question is then how to best benefit from recent advances in AI and how to generate, format and disseminate data to enable future breakthroughs in AI-guided drug discovery. We present here the recommendations of a working group composed of experts from both the public and private sectors. Robust data management requires precise ontologies and standardized vocabulary while a centralized database architecture across laboratories facilitates data integration into high-value datasets. Lab automation and opening electronic lab notebooks to data mining push the boundaries of data sharing and data modeling. Important considerations for building robust machine-learning models include transparent and reproducible data processing, choosing the most relevant data representation, defining the right training and test sets, and estimating prediction uncertainty. Beyond data-sharing, cloud-based computing can be harnessed to build and disseminate machine-learning models. Important vectors of acceleration for hit and chemical probe discovery will be (1) the real-time integration of experimental data generation and modeling workflows within design-make-test-analyze (DMTA) cycles openly, and at scale and (2) the adoption of a mindset where data scientists and experimentalists work as a unified team, and where data science is incorporated into the experimental design.


Assuntos
Ciência de Dados , Descoberta de Drogas , Aprendizado de Máquina , Descoberta de Drogas/métodos , Ciência de Dados/métodos , Humanos , Inteligência Artificial , Disseminação de Informação/métodos , Mineração de Dados/métodos , Computação em Nuvem , Bases de Dados Factuais
17.
Nucleic Acids Res ; 39(12): e79, 2011 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-21486749

RESUMO

Cost-effective oligonucleotide genotyping arrays like the Affymetrix SNP 6.0 are still the predominant technique to measure DNA copy number variations (CNVs). However, CNV detection methods for microarrays overestimate both the number and the size of CNV regions and, consequently, suffer from a high false discovery rate (FDR). A high FDR means that many CNVs are wrongly detected and therefore not associated with a disease in a clinical study, though correction for multiple testing takes them into account and thereby decreases the study's discovery power. For controlling the FDR, we propose a probabilistic latent variable model, 'cn.FARMS', which is optimized by a Bayesian maximum a posteriori approach. cn.FARMS controls the FDR through the information gain of the posterior over the prior. The prior represents the null hypothesis of copy number 2 for all samples from which the posterior can only deviate by strong and consistent signals in the data. On HapMap data, cn.FARMS clearly outperformed the two most prevalent methods with respect to sensitivity and FDR. The software cn.FARMS is publicly available as a R package at http://www.bioinf.jku.at/software/cnfarms/cnfarms.html.


Assuntos
Variações do Número de Cópias de DNA , Modelos Estatísticos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Software , Algoritmos , Alelos , Biologia Computacional , Polimorfismo de Nucleotídeo Único
18.
Chem Sci ; 14(12): 3235-3246, 2023 Mar 22.
Artigo em Inglês | MEDLINE | ID: mdl-36970100

RESUMO

Automated synthesis planning is key for efficient generative chemistry. Since reactions of given reactants may yield different products depending on conditions such as the chemical context imposed by specific reagents, computer-aided synthesis planning should benefit from recommendations of reaction conditions. Traditional synthesis planning software, however, typically proposes reactions without specifying such conditions, relying on human organic chemists who know the conditions to carry out suggested reactions. In particular, reagent prediction for arbitrary reactions, a crucial aspect of condition recommendation, has been largely overlooked in cheminformatics until recently. Here we employ the Molecular Transformer, a state-of-the-art model for reaction prediction and single-step retrosynthesis, to tackle this problem. We train the model on the US patents dataset (USPTO) and test it on Reaxys to demonstrate its out-of-distribution generalization capabilities. Our reagent prediction model also improves the quality of product prediction: the Molecular Transformer is able to substitute the reagents in the noisy USPTO data with reagents that enable product prediction models to outperform those trained on plain USPTO. This makes it possible to improve upon the state-of-the-art in reaction product prediction on the USPTO MIT benchmark.

19.
Nat Rev Drug Discov ; 22(11): 895-916, 2023 11.
Artigo em Inglês | MEDLINE | ID: mdl-37697042

RESUMO

Developments in computational omics technologies have provided new means to access the hidden diversity of natural products, unearthing new potential for drug discovery. In parallel, artificial intelligence approaches such as machine learning have led to exciting developments in the computational drug design field, facilitating biological activity prediction and de novo drug design for molecular targets of interest. Here, we describe current and future synergies between these developments to effectively identify drug candidates from the plethora of molecules produced by nature. We also discuss how to address key challenges in realizing the potential of these synergies, such as the need for high-quality datasets to train deep learning algorithms and appropriate strategies for algorithm validation.


Assuntos
Inteligência Artificial , Produtos Biológicos , Humanos , Algoritmos , Aprendizado de Máquina , Descoberta de Drogas , Desenho de Fármacos , Produtos Biológicos/farmacologia
20.
J Chem Theory Comput ; 18(8): 5068-5078, 2022 Aug 09.
Artigo em Inglês | MEDLINE | ID: mdl-35837736

RESUMO

Existing computational methods for estimating pKa values in proteins rely on theoretical approximations and lengthy computations. In this work, we use a data set of 6 million theoretically determined pKa shifts to train deep learning models, which are shown to rival the physics-based predictors. These neural networks managed to infer the electrostatic contributions of different chemical groups and learned the importance of solvent exposure and close interactions, including hydrogen bonds. Although trained only using theoretical data, our pKAI+ model displayed the best accuracy in a test set of ∼750 experimental values. Inference times allow speedups of more than 1000× compared to physics-based methods. By combining speed, accuracy, and a reasonable understanding of the underlying physics, our models provide a game-changing solution for fast estimations of macroscopic pKa values from ensembles of microscopic values as well as for many downstream applications such as molecular docking and constant-pH molecular dynamics simulations.


Assuntos
Aprendizado Profundo , Simulação de Acoplamento Molecular , Simulação de Dinâmica Molecular , Proteínas/química , Eletricidade Estática
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA