RESUMEN
BACKGROUND: Liquid-liquid phase separation (LLPS) by biomolecules plays a central role in various biological phenomena and has garnered significant attention. The behavior of LLPS is strongly influenced by the characteristics of RNAs and environmental factors such as pH and temperature, as well as the properties of proteins. Recently, several databases recording LLPS-related biomolecules have been established, and prediction models of LLPS-related phenomena have been explored using these databases. However, a prediction model that concurrently considers proteins, RNAs, and experimental conditions has not been developed due to the limited information available from individual experiments in public databases. RESULTS: To address this challenge, we have constructed a new dataset, RNAPSEC, which serves each experiment as a data point. This dataset was accomplished by manually collecting data from public literature. Utilizing RNAPSEC, we developed two prediction models that consider a protein, RNA, and experimental conditions. The first model can predict the LLPS behavior of a protein and RNA under given experimental conditions. The second model can predict the required conditions for a given protein and RNA to undergo LLPS. CONCLUSIONS: RNAPSEC and these prediction models are expected to accelerate our understanding of the roles of proteins, RNAs, and environmental factors in LLPS.
Asunto(s)
Proteínas Intrínsecamente Desordenadas , ARN , ARN/genética , Proteínas Intrínsecamente Desordenadas/químicaRESUMEN
Deep learning systems (DLSs) have been developed for the histopathological assessment of various types of tumors, but none are suitable for differential diagnosis between follicular thyroid carcinoma (FTC) and follicular adenoma (FA). Furthermore, whether DLSs can identify the malignant characteristics of thyroid tumors based only on random views of tumor tissue histology has not been evaluated. In this study, we developed DLSs able to differentiate between FTC and FA based on 3 types of convolutional neural network architecture: EfficientNet, VGG16, and ResNet50. The performance of all 3 DLSs was excellent (area under the receiver operating characteristic curve = 0.91 ± 0.04; F1 score = 0.82 ± 0.06). Visual explanations using gradient-weighted class activation mapping suggested that the diagnosis of both FTC and FA was largely dependent on nuclear features. The DLSs were then trained with FTC images and linked information (presence or absence of recurrence within 10 years, vascular invasion, and wide capsular invasion). The ability of the DLSs to diagnose these characteristics was then determined. The results showed that, based on the random views of histology, the DLSs could predict the risk of FTC recurrence, vascular invasion, and wide capsular invasion with a certain level of accuracy (area under the receiver operating characteristic curve = 0.67 ± 0.13, 0.62 ± 0.11, and 0.65 ± 0.09, respectively). Further improvement of our DLSs could lead to the establishment of automated differential diagnosis systems requiring only biopsy specimens.
Asunto(s)
Adenocarcinoma Folicular , Adenoma , Aprendizaje Profundo , Neoplasias de la Tiroides , Humanos , Diagnóstico Diferencial , Neoplasias de la Tiroides/diagnóstico , Neoplasias de la Tiroides/patología , Adenocarcinoma Folicular/diagnóstico , Adenocarcinoma Folicular/patología , Adenoma/diagnóstico , Adenoma/patologíaRESUMEN
Early disease detection and prevention methods based on effective interventions are gaining attention worldwide. Progress in precision medicine has revealed that substantial heterogeneity exists in health data at the individual level and that complex health factors are involved in chronic disease development. Machine-learning techniques have enabled precise personal-level disease prediction by capturing individual differences in multivariate data. However, it is challenging to identify what aspects should be improved for disease prevention based on future disease-onset prediction because of the complex relationships among multiple biomarkers. Here, we present a health-disease phase diagram (HDPD) that represents an individual's health state by visualizing the future-onset boundary values of multiple biomarkers that fluctuate early in the disease progression process. In HDPDs, future-onset predictions are represented by perturbing multiple biomarker values while accounting for dependencies among variables. We constructed HDPDs for 11 diseases using longitudinal health checkup cohort data of 3,238 individuals, comprising 3,215 measurement items and genetic data. The improvement of biomarker values to the non-onset region in HDPD remarkably prevented future disease onset in 7 out of 11 diseases. HDPDs can represent individual physiological states in the onset process and be used as intervention goals for disease prevention.
Asunto(s)
Aprendizaje Automático , Medicina de Precisión , Humanos , Biomarcadores , SaludRESUMEN
In chemistry and materials science, researchers and engineers discover, design, and optimize chemical compounds or materials with their professional knowledge and techniques. At the highest level of abstraction, this process is formulated as black-box optimization. For instance, the trial-and-error process of synthesizing various molecules for better material properties can be regarded as optimizing a black-box function describing the relation between a chemical formula and its properties. Various black-box optimization algorithms have been developed in the machine learning and statistics communities. Recently, a number of researchers have reported successful applications of such algorithms to chemistry. They include the design of photofunctional molecules and medical drugs, optimization of thermal emission materials and high Li-ion conductive solid electrolytes, and discovery of a new phase in inorganic thin films for solar cells.There are a wide variety of algorithms available for black-box optimization, such as Bayesian optimization, reinforcement learning, and active learning. Practitioners need to select an appropriate algorithm or, in some cases, develop novel algorithms to meet their demands. It is also necessary to determine how to best combine machine learning techniques with quantum mechanics- and molecular mechanics-based simulations, and experiments. In this Account, we give an overview of recent studies regarding automated discovery, design, and optimization based on black-box optimization. The Account covers the following algorithms: Bayesian optimization to optimize the chemical or physical properties, an optimization method using a quantum annealer, best-arm identification, gray-box optimization, and reinforcement learning. In addition, we introduce active learning and boundless objective-free exploration, which may not fall into the category of black-box optimization.Data quality and quantity are key for the success of these automated discovery techniques. As laboratory automation and robotics are put forward, automated discovery algorithms would be able to match human performance at least in some domains in the near future.
RESUMEN
To obtain observable physical or molecular properties such as ionization potential and fluorescent wavelength with quantum chemical (QC) computation, multi-step computation manipulated by a human is required. Hence, automating the multi-step computational process and making it a black box that can be handled by anybody are important for effective database construction and fast realistic material design through the framework of black-box optimization where machine learning algorithms are introduced as a predictor. Here, we propose a Python library, QCforever, to automate the computation of some molecular properties and chemical phenomena induced by molecules. This tool just requires a molecule file for providing its observable properties, automating the computation process of molecular properties (for ionization potential, fluorescence, etc.) and output analysis for providing their multi-values for evaluating a molecule. Incorporating the tool in black-box optimization, we can explore molecules that have properties we desired within the limitation of QC computation.
Asunto(s)
Algoritmos , Aprendizaje Automático , Bases de Datos Factuales , HumanosRESUMEN
Designing highly selective molecules for a drug target protein is a challenging task in drug discovery. This task can be regarded as a multiobjective problem that simultaneously satisfies criteria for various objectives, such as selectivity for a target protein, pharmacokinetic endpoints, and drug-like indices. Recent breakthroughs in artificial intelligence have accelerated the development of molecular structure generation methods, and various researchers have applied them to computational drug designs and successfully proposed promising drug candidates. However, designing efficient selective inhibitors with releasing activities against various homologs of a target protein remains a difficult issue. In this study, we developed a de novo structure generator based on reinforcement learning that is capable of simultaneously optimizing multiobjective problems. Our structure generator successfully proposed selective inhibitors for tyrosine kinases while optimizing 18 objectives consisting of inhibitory activities against 9 tyrosine kinases, 3 pharmacokinetics endpoints, and 6 other important properties. These results show that our structure generator and optimization strategy for selective inhibitors will contribute to the further development of practical structure generators for drug designs.
Asunto(s)
Inteligencia Artificial , Método de Montecarlo , Diseño de Fármacos , TirosinaRESUMEN
Femtosecond X-ray pulse lasers are promising probes for the elucidation of the multiconformational states of biomolecules because they enable snapshots of single biomolecules to be observed as coherent diffraction images. Multi-image processing using an X-ray free-electron laser has proven to be a successful structural analysis method for viruses. However, the performance of single-particle analysis (SPA) for flexible biomolecules with sizes ≤100 nm remains difficult. Owing to the multiconformational states of biomolecules and noisy character of diffraction images, diffraction image improvement by multi-image processing is often ineffective for such molecules. Herein, a single-image super-resolution (SR) model was constructed using an SR convolutional neural network (SRCNN). Data preparation was performed in silico to consider the actual observation situation with unknown molecular orientations and the fluctuation of molecular structure and incident X-ray intensity. It was demonstrated that the trained SRCNN model improved the single-particle diffraction image quality, corresponding to an observed image with an incident X-ray intensity (approximately three to seven times higher than the original X-ray intensity), while retaining the individuality of the diffraction images. The feasibility of SPA for flexible biomolecules with sizes ≤100 nm was dramatically increased by introducing the SRCNN improvement at the beginning of the various structural analysis schemes.
Asunto(s)
Procesamiento de Imagen Asistido por Computador , Redes Neurales de la Computación , Procesamiento de Imagen Asistido por Computador/métodos , Rayos Láser , Difracción de Rayos XRESUMEN
Computer-aided synthesis planning (CASP) aims to assist chemists in performing retrosynthetic analysis for which they utilize their experiments, intuition, and knowledge. Recent breakthroughs in machine learning (ML) techniques, including deep neural networks, have significantly improved data-driven synthetic route designs without human intervention. However, learning chemical knowledge by ML for practical synthesis planning has not yet been adequately achieved and remains a challenging problem. In this study, we developed a data-driven CASP application integrated with various portions of retrosynthesis knowledge called "ReTReK" that introduces the knowledge as adjustable parameters into the evaluation of promising search directions. The experimental results showed that ReTReK successfully searched synthetic routes based on the specified retrosynthesis knowledge, indicating that the synthetic routes searched with the knowledge were preferred to those without the knowledge. The concept of integrating retrosynthesis knowledge as adjustable parameters into a data-driven CASP application is expected to enhance the performance of both existing data-driven CASP applications and those under development.
Asunto(s)
Aprendizaje Automático , Redes Neurales de la Computación , Humanos , Programas InformáticosRESUMEN
Recently, artificial intelligence (AI)-enabled de novo molecular generators (DNMGs) have automated molecular design based on data-driven or simulation-based property estimates. In some domains like the game of Go where AI surpassed human intelligence, humans are trying to learn from AI about the best strategy of the game. To understand DNMG's strategy of molecule optimization, we propose an algorithm called characteristic functional group monitoring (CFGM). Given a time series of generated molecules, CFGM monitors statistically enriched functional groups in comparison to the training data. In the task of absorption wavelength maximization of pure organic molecules (consisting of H, C, N, and O), we successfully identified a strategic change from diketone and aniline derivatives to quinone derivatives. In addition, CFGM led us to a hypothesis that 1,2-quinone is an unconventional chromophore, which was verified with chemical synthesis. This study shows the possibility that human experts can learn from DNMGs to expand their ability to discover functional molecules.
RESUMEN
Stimuli-responsive polymers with complicated but controllable shape-morphing behaviors are critically desirable in several engineering fields. Among the various shape-morphing materials, cross-linked polymers with exchangeable bonds in dynamic network topology can undergo on-demand geometric change via solid-state plasticity while maintaining the advantageous properties of cross-linked polymers. However, these dynamic polymers are susceptible to creep deformation that results in their dimensional instability, a highly undesirable drawback that limits their service longevity and applications. Inspired by the natural ice strategy, which realizes creep reduction using crystal structure transformation, we evaluate a dynamic cross-linked polymer with tunable creep behavior through topological alternation. This alternation mechanism uses the thermally triggered disulfide-ene reaction to convert the network topology - from dynamic to static - in a polymerized bulk material. Thus, such a dynamic polymer can exhibit topological rearrangement for thermal plasticity at 130°C to resemble typical dynamic cross-linked polymers. Following the topological alternation at 180°C, the formation of a static topology reduces creep deformation by more than 85% in the same polymer. Owing to temperature-dependent selectivity, our cross-linked polymer exhibits a shape-morphing ability while enhancing its creep resistance for dimensional stability and service longevity after sequentially topological alternation. Our design enriches the design of dynamic covalent polymers, which potentially expands their utility in fabricating geometrically sophisticated multifunctional devices.
RESUMEN
Recently, molecular generation models based on deep learning have attracted significant attention in drug discovery. However, most existing molecular generation models have serious limitations in the context of drug design wherein they do not sufficiently consider the effect of the three-dimensional (3D) structure of the target protein in the generation process. In this study, we developed a new deep learning-based molecular generator, SBMolGen, that integrates a recurrent neural network, a Monte Carlo tree search, and docking simulations. The results of an evaluation using four target proteins (two kinases and two G protein-coupled receptors) showed that the generated molecules had a better binding affinity score (docking score) than the known active compounds, and the generated molecules possessed a broader chemical space distribution. SBMolGen not only generates novel binding active molecules but also presents 3D docking poses with target proteins, which will be useful in subsequent drug design. The code is available at https://github.com/clinfo/SBMolGen.
Asunto(s)
Inteligencia Artificial , Redes Neurales de la Computación , Diseño de Fármacos , Descubrimiento de Drogas , Simulación del Acoplamiento Molecular , ProteínasRESUMEN
In the two-alternative forced-choice (2AFC) paradigm, manual responses such as pointing have been widely used as measures to estimate cognitive abilities. While pointing measurements can be easily collected, coded, analyzed, and interpreted, absent responses are often observed particularly when adopting these measures for toddler studies, which leads to an increase of missing data. Although looking responses such as preferential looking can be available as alternative measures in such cases, it is unknown how well looking measurements can be interpreted as equivalent to manual ones. This study aimed to answer this question by investigating how accurately pointing responses (i.e., left or right) could be predicted from concurrent preferential looking. Using pre-existing videos of toddlers aged 18-23 months engaged in an intermodal word comprehension task, we developed models predicting manual from looking responses. Results showed substantial prediction accuracy for both the Simple Majority Vote and Machine Learning-Based classifiers, which indicates that looking responses would be reasonable alternative measures of manual ones. However, the further exploratory analysis revealed that when applying the created models for data of toddlers who did not produce clear pointing responses, the estimation agreement of missing pointing between the models and the human coders slightly dropped. This indicates that looking responses without pointing were qualitatively different from those with pointing. Bridging two measurements in forced-choice tasks would help researchers avoid wasting collected data due to the absence of manual responses and interpret results from different modalities comprehensively.
Asunto(s)
Conducta Infantil/fisiología , Desarrollo Infantil/fisiología , Conducta de Elección/fisiología , Fijación Ocular/fisiología , Gestos , Pruebas Neuropsicológicas , Psicometría , Preescolar , Femenino , Humanos , Lactante , Masculino , Pruebas Neuropsicológicas/normas , Psicometría/normasRESUMEN
Biomolecular imaging using X-ray free-electron lasers (XFELs) has been successfully applied to serial femtosecond crystallography. However, the application of single-particle analysis for structure determination using XFELs with 100 nm or smaller biomolecules has two practical problems: the incomplete diffraction data sets for reconstructing 3D assembled structures and the heterogeneous conformational states of samples. A new diffraction template matching method is thus presented here to retrieve a plausible 3D structural model based on single noisy target diffraction patterns, assuming candidate structures. Two concepts are introduced here: prompt candidate diffraction, generated by enhanced sampled coarse-grain (CG) candidate structures, and efficient molecular orientation searching for matching based on Bayesian optimization. A CG model-based diffraction-matching protocol is proposed that achieves a 100-fold speed increase compared to exhaustive diffraction matching using an all-atom model. The conditions that enable multiconformational analysis were also investigated by simulated diffraction data for various conformational states of chromatin and ribosomes. The proposed method can enable multiconformational analysis, with a structural resolution of at least 20 Å for 270-800 Å flexible biomolecules, in experimental single-particle structure analyses that employ XFELs.
Asunto(s)
Rayos Láser , Imagen Individual de Molécula , Teorema de Bayes , Cristalografía , Conformación Molecular , Difracción de Rayos XRESUMEN
Nuclear magnetic resonance (NMR) spectroscopy is an effective tool for identifying molecules in a sample. Although many previously observed NMR spectra are accumulated in public databases, they cover only a tiny fraction of the chemical space, and molecule identification is typically accomplished manually based on expert knowledge. Herein, we propose NMR-TS, a machine-learning-based python library, to automatically identify a molecule from its NMR spectrum. NMR-TS discovers candidate molecules whose NMR spectra match the target spectrum by using deep learning and density functional theory (DFT)-computed spectra. As a proof-of-concept, we identify prototypical metabolites from their computed spectra. After an average 5451 DFT runs for each spectrum, six of the nine molecules are identified correctly, and proximal molecules are obtained in the other cases. This encouraging result implies that de novo molecule generation can contribute to the fully automated identification of chemical structures. NMR-TS is available at https://github.com/tsudalab/NMR-TS.
RESUMEN
Motivation: Fast and accurate prediction of protein-ligand binding structures is indispensable for structure-based drug design and accurate estimation of binding free energy of drug candidate molecules in drug discovery. Recently, accurate pose prediction methods based on short Molecular Dynamics (MD) simulations, such as MM-PBSA and MM-GBSA, among generated docking poses have been used. Since molecular structures obtained from MD simulation depend on the initial condition, taking the average over different initial conditions leads to better accuracy. Prediction accuracy of protein-ligand binding poses can be improved with multiple runs at different initial velocity. Results: This paper shows that a machine learning method, called Best Arm Identification, can optimally control the number of MD runs for each binding pose. It allows us to identify a correct binding pose with a minimum number of total runs. Our experiment using three proteins and eight inhibitors showed that the computational cost can be reduced substantially without sacrificing accuracy. This method can be applied for controlling all kinds of molecular simulations to obtain best results under restricted computational resources. Availability and implementation: Code and data are available on GitHub at https://github.com/tsudalab/bpbi. Contact: terayama@cbms.k.u-tokyo.ac.jp or tsuda@k.u-tokyo.ac.jp. Supplementary information: Supplementary data are available at Bioinformatics online.
Asunto(s)
Descubrimiento de Drogas/métodos , Ligandos , Aprendizaje Automático , Simulación de Dinámica Molecular , Proteínas/química , Biología Computacional/métodos , Unión Proteica , Conformación Proteica , Proteínas/metabolismoRESUMEN
Recently, many research groups have been addressing data-driven approaches for (retro)synthetic reaction prediction and retrosynthetic analysis. Although the performances of the data-driven approach have progressed because of recent advances of machine learning and deep learning techniques, problems such as improving capability of reaction prediction and the black-box problem of neural networks persist for practical use by chemists. To spread data-driven approaches to chemists, we focused on two challenges: improvement of retrosynthetic reaction prediction and interpretability of the prediction. In this paper, we propose an interpretable prediction framework using graph convolutional networks (GCN) for retrosynthetic reaction prediction and integrated gradients (IG) for visualization of contributions to the prediction to address these challenges. As a result, from the viewpoint of balanced accuracies, our model showed better performances than the approach using an extended-connectivity fingerprint. Furthermore, IG-based visualization of the GCN prediction successfully highlighted reaction-related atoms.
Asunto(s)
Técnicas de Química Sintética , Gráficos por Computador , Redes Neurales de la ComputaciónRESUMEN
Computational techniques for accurate and efficient prediction of protein-protein complex structures are widely used for elucidating protein-protein interactions, which play important roles in biological systems. Recently, it has been reported that selecting a structure similar to the native structure among generated structure candidates (decoys) is possible by calculating binding free energies of the decoys based on all-atom molecular dynamics (MD) simulations with explicit solvent and the solution theory in the energy representation, which is called evERdock. A recent version of evERdock achieves a higher-accuracy decoy selection by introducing MD relaxation and multiple MD simulations/energy calculations; however, huge computational cost is required. In this paper, we propose an efficient decoy selection method using evERdock and the best arm identification (BAI) framework, which is one of the techniques of reinforcement learning. The BAI framework realizes an efficient selection by suppressing calculations for nonpromising decoys and preferentially calculating for the promising ones. We evaluate the performance of the proposed method for decoy selection problems of three protein-protein complex systems. Their results show that computational costs are successfully reduced by a factor of 4.05 (in the best case) compared to a standard decoy selection approach without sacrificing accuracy.
Asunto(s)
Aprendizaje Automático , Simulación de Dinámica Molecular , Proteínas/química , Unión Proteica , Conformación ProteicaRESUMEN
Protein-drug binding mode prediction from the apo-protein structure is challenging because drug binding often induces significant protein conformational changes. Here, the authors report a computational workflow that incorporates a novel pocket generation method. First, the closed protein pocket is expanded by repeatedly filling virtual atoms during molecular dynamics (MD) simulations. Second, after ligand docking toward the prepared pocket structures, binding mode candidates are ranked by MD/Molecular Mechanics Poisson-Boltzmann Surface Area. The authors validated our workflow using CDK2 kinase, which has an especially-closed ATP-binding pocket in the apo-form, and several inhibitors. The crystallographic pose coincided with the top-ranked docking pose for 59% (34/58) of the compounds and was within the top five-ranked ones for 88% (51/58), while those estimated by a conventional prediction protocol were 9% (5/58) and 50% (29/58), respectively. Our study demonstrates that the prediction accuracy is significantly improved by preceding pocket expansion, leading to generation of conformationally-diverse binding mode candidates. © 2018 Wiley Periodicals, Inc.
Asunto(s)
Quinasa 2 Dependiente de la Ciclina/química , Simulación de Dinámica Molecular , Inhibidores de Proteínas Quinasas/química , Sitios de Unión , Quinasa 2 Dependiente de la Ciclina/antagonistas & inhibidores , Humanos , Ligandos , Modelos Moleculares , Estructura Molecular , Inhibidores de Proteínas Quinasas/farmacologíaRESUMEN
Automatic design of organic materials requires black-box optimization in a vast chemical space. In conventional molecular design algorithms, a molecule is built as a combination of predetermined fragments. Recently, deep neural network models such as variational autoencoders and recurrent neural networks (RNNs) are shown to be effective in de novo design of molecules without any predetermined fragments. This paper presents a novel Python library ChemTS that explores the chemical space by combining Monte Carlo tree search and an RNN. In a benchmarking problem of optimizing the octanol-water partition coefficient and synthesizability, our algorithm showed superior efficiency in finding high-scoring molecules. ChemTS is available at https://github.com/tsudalab/ChemTS.
RESUMEN
The seafloor is inhabited by a large number of benthic invertebrates, and their importance in mediating carbon mineralization and biogeochemical cycles is recognized. However, the majority of fauna live below the sediment surface, so most means of survey rely on destructive sampling methods that are limited to documenting species presence rather than event driven activity and functionally important aspects of species behaviour. We have developed and tested a laboratory-based three-dimensional acoustic coring system that is capable of non-invasively visualizing the presence and activity of invertebrates within the sediment matrix. Here, we present reconstructed three-dimensional acoustic images of the sediment profile, with strong backscatter revealing the presence and position of individual benthic organisms. These data were used to train a three-dimensional convolutional neural network model and, using a combination of data augmentation and data correction techniques, we were able to identify individual species with an 88% accuracy. Combining three-dimensional acoustic coring with deep learning forms an effective and non-invasive means of providing detailed mechanistic information of in situ species-sediment interactions, opening new opportunities to quantify species-specific contributions to ecosystems.