RESUMO
Artificial intelligence (AI) is being increasingly integrated into scientific discovery to augment and accelerate research, helping scientists to generate hypotheses, design experiments, collect and interpret large datasets, and gain insights that might not have been possible using traditional scientific methods alone. Here we examine breakthroughs over the past decade that include self-supervised learning, which allows models to be trained on vast amounts of unlabelled data, and geometric deep learning, which leverages knowledge about the structure of scientific data to enhance model accuracy and efficiency. Generative AI methods can create designs, such as small-molecule drugs and proteins, by analysing diverse data modalities, including images and sequences. We discuss how these methods can help scientists throughout the scientific process and the central issues that remain despite such advances. Both developers and users of AI toolsneed a better understanding of when such approaches need improvement, and challenges posed by poor data quality and stewardship remain. These issues cut across scientific disciplines and require developing foundational algorithmic approaches that can contribute to scientific understanding or acquire it autonomously, making them critical areas of focus for AI innovation.
Assuntos
Inteligência Artificial , Projetos de Pesquisa , Inteligência Artificial/normas , Inteligência Artificial/tendências , Conjuntos de Dados como Assunto , Aprendizado Profundo , Projetos de Pesquisa/normas , Projetos de Pesquisa/tendências , Aprendizado de Máquina não SupervisionadoRESUMO
Drug-drug interaction (DDI) prediction identifies interactions of drug combinations in which the adverse side effects caused by the physicochemical incompatibility have attracted much attention. Previous studies usually model drug information from single or dual views of the whole drug molecules but ignore the detailed interactions among atoms, which leads to incomplete and noisy information and limits the accuracy of DDI prediction. In this work, we propose a novel dual-view drug representation learning network for DDI prediction ('DSN-DDI'), which employs local and global representation learning modules iteratively and learns drug substructures from the single drug ('intra-view') and the drug pair ('inter-view') simultaneously. Comprehensive evaluations demonstrate that DSN-DDI significantly improved performance on DDI prediction for the existing drugs by achieving a relatively improved accuracy of 13.01% and an over 99% accuracy under the transductive setting. More importantly, DSN-DDI achieves a relatively improved accuracy of 7.07% to unseen drugs and shows the usefulness for real-world DDI applications. Finally, DSN-DDI exhibits good transferability on synergistic drug combination prediction and thus can serve as a generalized framework in the drug discovery field.
Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Humanos , Interações Medicamentosas , Descoberta de Drogas , Biologia ComputacionalRESUMO
Precisely predicting the drug-drug interaction (DDI) is an important application and host research topic in drug discovery, especially for avoiding the adverse effect when using drug combination treatment for patients. Nowadays, machine learning and deep learning methods have achieved great success in DDI prediction. However, we notice that most of the works ignore the importance of the relation type when building the DDI prediction models. In this work, we propose a novel R$^2$-DDI framework, which introduces a relation-aware feature refinement module for drug representation learning. The relation feature is integrated into drug representation and refined in the framework. With the refinement features, we also incorporate the consistency training method to regularize the multi-branch predictions for better generalization. Through extensive experiments and studies, we demonstrate our R$^2$-DDI approach can significantly improve the DDI prediction performance over multiple real-world datasets and settings, and our method shows better generalization ability with the help of the feature refinement design.
Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Humanos , Interações Medicamentosas , Aprendizado de Máquina , Descoberta de DrogasRESUMO
Accurate prediction of drug-target affinity (DTA) is of vital importance in early-stage drug discovery, facilitating the identification of drugs that can effectively interact with specific targets and regulate their activities. While wet experiments remain the most reliable method, they are time-consuming and resource-intensive, resulting in limited data availability that poses challenges for deep learning approaches. Existing methods have primarily focused on developing techniques based on the available DTA data, without adequately addressing the data scarcity issue. To overcome this challenge, we present the Semi-Supervised Multi-task training (SSM) framework for DTA prediction, which incorporates three simple yet highly effective strategies: (1) A multi-task training approach that combines DTA prediction with masked language modeling using paired drug-target data. (2) A semi-supervised training method that leverages large-scale unpaired molecules and proteins to enhance drug and target representations. This approach differs from previous methods that only employed molecules or proteins in pre-training. (3) The integration of a lightweight cross-attention module to improve the interaction between drugs and targets, further enhancing prediction accuracy. Through extensive experiments on benchmark datasets such as BindingDB, DAVIS and KIBA, we demonstrate the superior performance of our framework. Additionally, we conduct case studies on specific drug-target binding activities, virtual screening experiments, drug feature visualizations and real-world applications, all of which showcase the significant potential of our work. In conclusion, our proposed SSM-DTA framework addresses the data limitation challenge in DTA prediction and yields promising results, paving the way for more efficient and accurate drug discovery processes.
Assuntos
Benchmarking , Descoberta de Drogas , Sistemas de Liberação de MedicamentosRESUMO
Acute pancreatitis (AP) can be complicated by inflammatory disorders of remote organs, such as lung injury, in which Jumonji domain-containing protein 3 (JMJD3) plays a vital role in proinflammatory responses. Currently, we found that JMJD3 expression was upregulated in the pancreas and lung in an AP male mouse model, which was also confirmed in AP patients. Further experiments revealed that the upregulation of JMJD3 and proinflammatory effects were possibly exerted by mitochondrial DNA (mtDNA) or oxidized-mtDNA from tissue injury caused by AP. The release of mtDNA and oxidized-mtDNA contributed to the infiltration of inflammatory monocytes in lung injury through the stimulator of IFN genes (STING)/TLR9-NF-κB-JMJD3-TNF-α pathway. The inhibition of JMJD3 or utilization of Jmjd3-cKO mice significantly alleviated pulmonary inflammation induced by AP. Blocking mtDNA oxidation or knocking down the TLR9/STING pathway effectively alleviated inflammation. Therefore, inhibition of JMJD3 or STING/TLR9 pathway blockage might be a potential therapeutic strategy to treat AP and the associated lung injury.
Assuntos
Lesão Pulmonar , Pancreatite , Masculino , Camundongos , Animais , Receptor Toll-Like 9/metabolismo , Doença Aguda , NF-kappa B/metabolismo , DNA Mitocondrial/genética , DNA Mitocondrial/metabolismoRESUMO
Pre-trained language models have attracted increasing attention in the biomedical domain, inspired by their great success in the general natural language domain. Among the two main branches of pre-trained language models in the general language domain, i.e. BERT (and its variants) and GPT (and its variants), the first one has been extensively studied in the biomedical domain, such as BioBERT and PubMedBERT. While they have achieved great success on a variety of discriminative downstream biomedical tasks, the lack of generation ability constrains their application scope. In this paper, we propose BioGPT, a domain-specific generative Transformer language model pre-trained on large-scale biomedical literature. We evaluate BioGPT on six biomedical natural language processing tasks and demonstrate that our model outperforms previous models on most tasks. Especially, we get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks, respectively, and 78.2% accuracy on PubMedQA, creating a new record. Our case study on text generation further demonstrates the advantage of BioGPT on biomedical literature to generate fluent descriptions for biomedical terms.
Assuntos
Mineração de Dados , Processamento de Linguagem NaturalRESUMO
The identification of active binding drugs for target proteins (referred to as drug-target interaction prediction) is the key challenge in virtual screening, which plays an essential role in drug discovery. Although recent deep learning-based approaches achieve better performance than molecular docking, existing models often neglect topological or spatial of intermolecular information, hindering prediction performance. We recognize this problem and propose a novel approach called the Intermolecular Graph Transformer (IGT) that employs a dedicated attention mechanism to model intermolecular information with a three-way Transformer-based architecture. IGT outperforms state-of-the-art (SoTA) approaches by 9.1% and 20.5% over the second best option for binding activity and binding pose prediction, respectively, and exhibits superior generalization ability to unseen receptor proteins than SoTA approaches. Furthermore, IGT exhibits promising drug screening ability against severe acute respiratory syndrome coronavirus 2 by identifying 83.1% active drugs that have been validated by wet-lab experiments with near-native predicted binding poses. Source code and datasets are available at https://github.com/microsoft/IGT-Intermolecular-Graph-Transformer.
Assuntos
Algoritmos , COVID-19 , Humanos , Simulação de Acoplamento Molecular , Proteínas/química , SoftwareRESUMO
Well understanding protein function and structure in computational biology helps in the understanding of human beings. To face the limited proteins that are annotated structurally and functionally, the scientific community embraces the self-supervised pre-training methods from large amounts of unlabeled protein sequences for protein embedding learning. However, the protein is usually represented by individual amino acids with limited vocabulary size (e.g. 20 type proteins), without considering the strong local semantics existing in protein sequences. In this work, we propose a novel pre-training modeling approach SPRoBERTa. We first present an unsupervised protein tokenizer to learn protein representations with local fragment pattern. Then, a novel framework for deep pre-training model is introduced to learn protein embeddings. After pre-training, our method can be easily fine-tuned for different protein tasks, including amino acid-level prediction task (e.g. secondary structure prediction), amino acid pair-level prediction task (e.g. contact prediction) and also protein-level prediction task (remote homology prediction, protein function prediction). Experiments show that our approach achieves significant improvements in all tasks and outperforms the previous methods. We also provide detailed ablation studies and analysis for our protein tokenizer and training framework.
Assuntos
Biologia Computacional , Proteínas , Humanos , Proteínas/química , Biologia Computacional/métodos , Sequência de Aminoácidos , Estrutura Secundária de Proteína , AminoácidosRESUMO
MOTIVATION: The interaction between drugs and targets (DTI) in human body plays a crucial role in biomedical science and applications. As millions of papers come out every year in the biomedical domain, automatically discovering DTI knowledge from biomedical literature, which are usually triplets about drugs, targets and their interaction, becomes an urgent demand in the industry. Existing methods of discovering biological knowledge are mainly extractive approaches that often require detailed annotations (e.g. all mentions of biological entities, relations between every two entity mentions, etc.). However, it is difficult and costly to obtain sufficient annotations due to the requirement of expert knowledge from biomedical domains. RESULTS: To overcome these difficulties, we explore an end-to-end solution for this task by using generative approaches. We regard the DTI triplets as a sequence and use a Transformer-based model to directly generate them without using the detailed annotations of entities and relations. Further, we propose a semi-supervised method, which leverages the aforementioned end-to-end model to filter unlabeled literature and label them. Experimental results show that our method significantly outperforms extractive baselines on DTI discovery. We also create a dataset, KD-DTI, to advance this task and release it to the community. AVAILABILITY AND IMPLEMENTATION: Our code and data are available at https://github.com/bert-nmt/BERT-DTI. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Publicações , Software , Humanos , Interações MedicamentosasRESUMO
Machine learning force fields (MLFFs) have gained popularity in recent years as they provide a cost-effective alternative to ab initio molecular dynamics (MD) simulations. Despite a small error on the test set, MLFFs inherently suffer from generalization and robustness issues during MD simulations. To alleviate these issues, we propose global force metrics and fine-grained metrics from element and conformation aspects to systematically measure MLFFs for every atom and every conformation of molecules. We selected three state-of-the-art MLFFs (ET, NequIP, and ViSNet) and comprehensively evaluated on aspirin, Ac-Ala3-NHMe, and Chignolin MD datasets with the number of atoms ranging from 21 to 166. Driven by the trained MLFFs on these molecules, we performed MD simulations from different initial conformations, analyzed the relationship between the force metrics and the stability of simulation trajectories, and investigated the reason for collapsed simulations. Finally, the performance of MLFFs and the stability of MD simulations can be further improved guided by the proposed force metrics for model training, specifically training MLFF models with these force metrics as loss functions, fine-tuning by reweighting samples in the original dataset, and continued training by recruiting additional unexplored data.
RESUMO
MOTIVATION: Gradient descent-based protein modeling is a popular protein structure prediction approach that takes as input the predicted inter-residue distances and other necessary constraints and folds protein structures by minimizing protein-specific energy potentials. The constraints from multiple predicted protein properties provide redundant and sometime conflicting information that can trap the optimization process into local minima and impairs the modeling efficiency. RESULTS: To address these issues, we developed a self-adaptive protein modeling framework, SAMF. It eliminates redundancy of constraints and resolves conflicts, folds protein structures in an iterative way, and picks up the best structures by a deep quality analysis system. Without a large amount of complicated domain knowledge and numerous patches as barriers, SAMF achieves the state-of-the-art performance by exploiting the power of cutting-edge techniques of deep learning. SAMF has a modular design and can be easily customized and extended. As the quality of input constraints is ever growing, the superiority of SAMF will be amplified over time. AVAILABILITY AND IMPLEMENTATION: The source code and data for reproducing the results is available at https://msracb.blob.core.windows.net/pub/psp/SAMF.zip. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Proteínas , Software , Proteínas/metabolismoRESUMO
Accurate timely estimation of emissions of nitrogen oxides (NOx) is a prerequisite for designing an effective strategy for reducing O3 and PM2.5 pollution. The satellite-based top-down method can provide near-real-time constraints on emissions; however, its efficiency is largely limited by efforts in dealing with the complex emission-concentration response. Here, we propose a novel machine-learning-based method using a physically informed variational autoencoder (VAE) emission predictor to infer NOx emissions from satellite-retrieved surface NO2 concentrations. The computational burden can be significantly reduced with the help of a neural network trained with a chemical transport model, allowing the VAE emission predictor to provide a timely estimation of posterior emissions based on the satellite-retrieved surface NO2 concentration. The VAE emission predictor successfully corrected the underestimation of NOx emissions in rural areas and the overestimation in urban areas, resulting in smaller normalized mean biases (reduced from -0.8 to -0.4) and larger R2 values (increased from 0.4 to 0.7). The interpretability of the VAE emission predictor was investigated using sensitivity analysis by modulating each feature, indicating that NO2 concentration and planetary boundary layer (PBL) height are important for estimating NOx emissions, which is consistent with our common knowledge. The advantages of the VAE emission predictor in efficiency, flexibility, and accuracy demonstrate its great potential in estimating the latest emissions and evaluating the control effectiveness from observations.
Assuntos
Poluentes Atmosféricos , Poluição do Ar , Poluentes Atmosféricos/análise , Poluição do Ar/análise , Redes Neurais de Computação , Óxido Nítrico/análise , Dióxido de Nitrogênio/análise , Óxidos de Nitrogênio/análise , Emissões de Veículos/análiseRESUMO
Fast and accurate prediction of ambient ozone (O3) formed from atmospheric photochemical processes is crucial for designing effective O3 pollution control strategies in the context of climate change. The chemical transport model (CTM) is the fundamental tool for O3 prediction and policy design, however, existing CTM-based approaches are computationally expensive, and resource burdens limit their usage and effectiveness in air quality management. Here we proposed a novel method (noted as DeepCTM) that using deep learning to mimic CTM simulations to improve the computational efficiency of photochemical modeling. The well-trained DeepCTM successfully reproduces CTM-simulated O3 concentration using input features of precursor emissions, meteorological factors, and initial conditions. The advantage of the DeepCTM is its high efficiency in identifying the dominant contributors to O3 formation and quantifying the O3 response to variations in emissions and meteorology. The emission-meteorology-concentration linkages implied by the DeepCTM are consistent with known mechanisms of atmospheric chemistry, indicating that the DeepCTM is also scientifically reasonable. The DeepCTM application in China suggests that O3 concentrations are strongly influenced by the initialized O3 concentration, as well as emission and meteorological factors during daytime when O3 is formed photochemically. The variation of meteorological factors such as short-wave radiation can also significantly modulate the O3 chemistry. The DeepCTM developed in this study exhibits great potential for efficiently representing the complex atmospheric system and can provide policymakers with urgently needed information for designing effective control strategies to mitigate O3 pollution.
RESUMO
BACKGROUND: Fragment libraries play a key role in fragment-assembly based protein structure prediction, where protein fragments are assembled to form a complete three-dimensional structure. Rich and accurate structural information embedded in fragment libraries has not been systematically extracted and used beyond fragment assembly. METHODS: To better leverage the valuable structural information for protein structure prediction, we extracted seven types of structural information from fragment libraries. We broadened the usage of such structural information by transforming fragment libraries into protein-specific potentials for gradient-descent based protein folding and encoding fragment libraries as structural features for protein property prediction. RESULTS: Fragment libraires improved the accuracy of protein folding and outperformed state-of-the-art algorithms with respect to predicted properties, such as torsion angles and inter-residue distances. CONCLUSION: Our work implies that the rich structural information extracted from fragment libraries can complement sequence-derived features to help protein structure prediction.
Assuntos
Algoritmos , Proteínas , Dobramento de Proteína , Proteínas/genéticaRESUMO
Surgical resection is a common therapeutic option for primary solid tumors. However, high cancer recurrence and metastatic rates after resection are the main cause of cancer related mortalities. This implies the existence of a "fertile soil" following surgery that facilitates colonization by circulating cancer cells. Myeloid-derived suppressor cells (MDSCs) are essential for premetastatic niche formation, and may persist in distant organs for up to 2 weeks after surgery. These postsurgical persistent lung MDSCs exhibit stronger immunosuppression compared with presurgical MDSCs, suggesting that surgery enhances MDSC function. Surgical stress and trauma trigger the secretion of systemic inflammatory cytokines, which enhance MDSC mobilization and proliferation. Additionally, damage associated molecular patterns (DAMPs) directly activate MDSCs through pattern recognition receptor-mediated signals. Surgery also increases vascular permeability, induces an increase in lysyl oxidase and extracellular matrix remodeling in lungs, that enhances MDSC mobilization. Postsurgical therapies that inhibit the induction of premetastatic niches by MDSCs promote the long-term survival of patients. Cyclooxygenase-2 inhibitors and ß-blockade, or their combination, may minimize the impact of surgical stress on MDSCs. Anti-DAMPs and associated inflammatory signaling inhibitors also are potential therapies. Existing therapies under tumor-bearing conditions, such as MDSCs depletion with low-dose chemotherapy or tyrosine kinase inhibitors, MDSCs differentiation using all-trans retinoic acid, and STAT3 inhibition merit clinical evaluation during the perioperative period. In addition, combining low-dose epigenetic drugs with chemokine receptors, reversing immunosuppression through the Enhanced Recovery After Surgery protocol, repairing vascular leakage, or inhibiting extracellular matrix remodeling also may enhance the long-term survival of curative resection patients.
Assuntos
Antineoplásicos , Células Supressoras Mieloides , Células Neoplásicas Circulantes , Humanos , Pulmão , Recidiva Local de NeoplasiaRESUMO
INTRODUCTION: Tenosynovial giant cell tumor (TGCT) is a locally aggressive tumor with colony-stimulating factor 1 receptor (CSF1R) signal expression. However, there is a lack of better in vivo and ex vivo models for TGCT. This study aims to establish a favorable preclinical translational platform, which would enable the validation of efficient and personalized therapeutic candidates for TGCT. PATIENTS AND METHODS: Histological analyses were performed for the included patients. Fresh TGCT tumors were collected and sliced into 1.0-3.0 mm3 sections using a sterilized razor blade. The tumor grafts were surgically implanted into subrenal capsules of athymic mice to establish patient-derived tumor xenograft (PDTX) mouse models. Histological and response patterns to CSF1R inhibitors evaluations were analyzed. In addition, ex vivo cultures of patient-derived explants (PDEs) with endpoint analysis were used to validate TGCT graft response patterns to CSF1R inhibitors. RESULTS: The TGCT tumor grafts that were implanted into athymic mice subrenal capsules maintained their original morphological and histological features. The "take" rate of this model was 95% (19/20). Administration of CSF1R inhibitors (PLX3397, and a novel candidate, WXFL11420306) to TGCT-PDTX mice was shown to reduce tumor size while inducing intratumoral apoptosis. In addition, the CSF1R inhibitors suppressed circulating nonspecific monocyte levels and CD163-positive cells within tumors. These response patterns of engrafts to PDTX were validated by ex vivo PDE cultures. CONCLUSIONS: Subrenal capsule supports the growth of TGCT tumor grafts, maintaining their original morphology and histology. This TGCT-PDTX model plus ex vivo explant cultures is a potential preclinical translational platform for locally aggressive tumors, such as TGCT.
Assuntos
Antineoplásicos , Tumor de Células Gigantes de Bainha Tendinosa , Preparações Farmacêuticas , Animais , Antineoplásicos/uso terapêutico , Tumor de Células Gigantes de Bainha Tendinosa/tratamento farmacológico , Xenoenxertos , Humanos , CamundongosRESUMO
Efficient prediction of the air quality response to emission changes is a prerequisite for an integrated assessment system in developing effective control policies. Yet, representing the nonlinear response of air quality to emission controls with accuracy remains a major barrier in air quality-related decision making. Here, we demonstrate a novel method that combines deep learning approaches with chemical indicators of pollutant formation to quickly estimate the coefficients of air quality response functions using ambient concentrations of 18 chemical indicators simulated with a comprehensive atmospheric chemical transport model (CTM). By requiring only two CTM simulations for model application, the new method significantly enhances the computational efficiency compared to existing methods that achieve lower accuracy despite requiring 20+ CTM simulations (the benchmark statistical model). Our results demonstrate the utility of deep learning approaches for capturing the nonlinearity of atmospheric chemistry and physics and the prospects of the new method to support effective policymaking in other environment systems.
Assuntos
Poluentes Atmosféricos , Poluição do Ar , Aprendizado Profundo , Poluentes Atmosféricos/análise , Poluição do Ar/análise , Monitoramento Ambiental , Modelos EstatísticosRESUMO
The present work describes the in vitro antibacterial evaluation of some new pyrimidine derivatives. Twenty-two target compounds were designed, synthesized and preliminarily explored for their antimicrobial activities. The antimicrobial assay revealed that some target compounds exhibited significantly inhibitory efficiencies toward bacteria and fungal including drug-resistant pathogens. Compound 7c presented the most potent inhibitory activities against Gram-positive bacteria (e.g., Staphylococcus aureus 4220), Gram-negative bacteria (e.g., Escherichia coli 1924) and the fungus Candida albicans 7535, with an MIC of 2.4 µmol/L. Compound 7c was also the most potent, with MICs of 2.4 or 4.8 µmol/L against four multidrug-resistant, Gram-positive bacterial strains. The toxicity evaluation of the compounds 7c, 10a, 19d and 26b was assessed in human normal liver cells (L02 cells). Molecular docking simulation and analysis suggested that compound 7c has a good interaction with the active cavities of dihydrofolate reductase (DHFR). In vitro enzyme study implied that compound 7c also displayed DHFR inhibition.
Assuntos
Antibacterianos/química , Antibacterianos/síntese química , Pirimidinas/química , Antibacterianos/farmacologia , Linhagem Celular , Fungos/efeitos dos fármacos , Bactérias Gram-Negativas/efeitos dos fármacos , Bactérias Gram-Positivas/efeitos dos fármacos , Humanos , Testes de Sensibilidade Microbiana/métodos , Simulação de Acoplamento Molecular/métodos , Relação Estrutura-AtividadeRESUMO
LINC00152 has been considered to be associated with the tumorigenesis and the occurrence of gastric cancer; however, the mechanism of LINC00152 has yet to be fully elucidated. In the present study, the expression levels of LINC00152 in tissues, serum, and peripheral blood mononuclear cells (PBMCs) of patients with gastric cancer were determined using real-time polymerase chain reaction. The functions of LINC00152 with respect to the proliferation, apoptosis, migration, and invasive abilities of the gastric cancer cells were evaluated by cell proliferation analysis, flow cytometry, cell scratch wound assay, and transwell migration experiments. A mouse xenotransplant model of gastric tumors was established to detect the role of LINC00152 in vivo, and the expression levels of B-cell lymphoma-2 (Bcl-2) family proteins were investigated by Western blot analysis. The results revealed that LINC00152 was overexpressed in tissues, serum, and PBMCs of patients with gastric cancer. Moreover, LINC00152 could promote the migration and invasive abilities and suppress the apoptosis, of gastric cancer cells through regulating the Bcl-2 protein family. LINC00152 could bind with Bcl-2 directly to induce the activation of cell cycle signaling, and this may be a potential target for the therapy of gastric cancer in the future.