Pesquisa | Portal de Pesquisa da BVS

Pretraining Strategies for Structure Agnostic Material Property Prediction.

Huang, Hongshuo; Magar, Rishikesh; Barati Farimani, Amir.

J Chem Inf Model ; 64(3): 627-637, 2024 Feb 12.

Artigo em Inglês | MEDLINE | ID: mdl-38301621

RESUMO

In recent years, machine learning (ML), especially graph neural network (GNN) models, has been successfully used for fast and accurate prediction of material properties. However, most ML models rely on relaxed crystal structures to develop descriptors for accurate predictions. Generating these relaxed crystal structures can be expensive and time-consuming, thus requiring an additional processing step for models that rely on them. To address this challenge, structure-agnostic methods have been developed, which use fixed-length descriptors engineered based on human knowledge about the material. However, the fixed-length descriptors are often hand-engineered and require extensive domain knowledge and generally are not used in the context of learnable models which are known to have a superior performance. Recent advancements have proposed learnable frameworks that can construct representations based on stoichiometry alone, allowing the flexibility of using deep learning frameworks as well as leveraging structure-agnostic learning. In this work, we propose three different pretraining strategies that can be used to pretrain these structure-agnostic, learnable frameworks to further improve the downstream material property prediction performance. We incorporate strategies such as self-supervised learning (SSL), fingerprint learning (FL), and multimodal learning (ML) and demonstrate their efficacy on downstream tasks for the Roost architecture, a popular structure-agnostic framework. Our results show significant improvement in small data sets and data efficiency in the larger data sets, underscoring the potential of our pretrain strategies that effectively leverage unlabeled data for accurate material property prediction.

Assuntos

Aprendizado de Máquina , Redes Neurais de Computação , Humanos

GPCR-BERT: Interpreting Sequential Design of G Protein-Coupled Receptors Using Protein Language Models.

Kim, Seongwon; Mollaei, Parisa; Antony, Akshay; Magar, Rishikesh; Barati Farimani, Amir.

J Chem Inf Model ; 64(4): 1134-1144, 2024 Feb 26.

Artigo em Inglês | MEDLINE | ID: mdl-38340054

RESUMO

With the rise of transformers and large language models (LLMs) in chemistry and biology, new avenues for the design and understanding of therapeutics have been opened up to the scientific community. Protein sequences can be modeled as language and can take advantage of recent advances in LLMs, specifically with the abundance of our access to the protein sequence data sets. In this letter, we developed the GPCR-BERT model for understanding the sequential design of G protein-coupled receptors (GPCRs). GPCRs are the target of over one-third of Food and Drug Administration-approved pharmaceuticals. However, there is a lack of comprehensive understanding regarding the relationship among amino acid sequence, ligand selectivity, and conformational motifs (such as NPxxY, CWxP, and E/DRY). By utilizing the pretrained protein model (Prot-Bert) and fine-tuning with prediction tasks of variations in the motifs, we were able to shed light on several relationships between residues in the binding pocket and some of the conserved motifs. To achieve this, we took advantage of attention weights and hidden states of the model that are interpreted to extract the extent of contributions of amino acids in dictating the type of masked ones. The fine-tuned models demonstrated high accuracy in predicting hidden residues within the motifs. In addition, the analysis of embedding was performed over 3D structures to elucidate the higher-order interactions within the conformations of the receptors.

Assuntos

Receptores Acoplados a Proteínas G , Receptores Acoplados a Proteínas G/química , Sequência de Aminoácidos , Ligantes

MOFormer: Self-Supervised Transformer Model for Metal-Organic Framework Property Prediction.

Cao, Zhonglin; Magar, Rishikesh; Wang, Yuyang; Barati Farimani, Amir.

J Am Chem Soc ; 145(5): 2958-2967, 2023 Feb 08.

Artigo em Inglês | MEDLINE | ID: mdl-36706365

RESUMO

Metal-organic frameworks (MOFs) are materials with a high degree of porosity that can be used for many applications. However, the chemical space of MOFs is enormous due to the large variety of possible combinations of building blocks and topology. Discovering the optimal MOFs for specific applications requires an efficient and accurate search over countless potential candidates. Previous high-throughput screening methods using computational simulations like DFT can be time-consuming. Such methods also require the 3D atomic structures of MOFs, which adds one extra step when evaluating hypothetical MOFs. In this work, we propose a structure-agnostic deep learning method based on the Transformer model, named as MOFormer, for property predictions of MOFs. MOFormer takes a text string representation of MOF (MOFid) as input, thus circumventing the need of obtaining the 3D structure of a hypothetical MOF and accelerating the screening process. By comparing to other descriptors such as Stoichiometric-120 and revised autocorrelations, we demonstrate that MOFormer can achieve state-of-the-art structure-agnostic prediction accuracy on all benchmarks. Furthermore, we introduce a self-supervised learning framework that pretrains the MOFormer via maximizing the cross-correlation between its structure-agnostic representations and structure-based representations of the crystal graph convolutional neural network (CGCNN) on >400k publicly available MOF data. Benchmarks show that pretraining improves the prediction accuracy of both models on various downstream prediction tasks. Furthermore, we revealed that MOFormer can be more data-efficient on quantum-chemical property prediction than structure-based CGCNN when training data is limited. Overall, MOFormer provides a novel perspective on efficient MOF property prediction using deep learning.

Improving Molecular Contrastive Learning via Faulty Negative Mitigation and Decomposed Fragment Contrast.

Wang, Yuyang; Magar, Rishikesh; Liang, Chen; Barati Farimani, Amir.

J Chem Inf Model ; 62(11): 2713-2725, 2022 06 13.

Artigo em Inglês | MEDLINE | ID: mdl-35638560

RESUMO

Deep learning has been a prevalence in computational chemistry and widely implemented in molecular property predictions. Recently, self-supervised learning (SSL), especially contrastive learning (CL), has gathered growing attention for the potential to learn molecular representations that generalize to the gigantic chemical space. Unlike supervised learning, SSL can directly leverage large unlabeled data, which greatly reduces the effort to acquire molecular property labels through costly and time-consuming simulations or experiments. However, most molecular SSL methods borrow the insights from the machine learning community but neglect the unique cheminformatics (e.g., molecular fingerprints) and multilevel graphical structures (e.g., functional groups) of molecules. In this work, we propose iMolCLR, improvement of Molecular Contrastive Learning of Representations with graph neural networks (GNNs) in two aspects: (1) mitigating faulty negative contrastive instances via considering cheminformatics similarities between molecule pairs and (2) fragment-level contrasting between intramolecule and intermolecule substructures decomposed from molecules. Experiments have shown that the proposed strategies significantly improve the performance of GNN models on various challenging molecular property predictions. In comparison to the previous CL framework, iMolCLR demonstrates an averaged 1.2% improvement of ROC-AUC on eight classification benchmarks and an averaged 10.1% decrease of the error on six regression benchmarks. On most benchmarks, the generic GNN pretrained by iMolCLR rivals or even surpasses supervised learning models with sophisticated architectures and engineered features. Further investigations demonstrate that representations learned through iMolCLR intrinsically embed scaffolds and functional groups that can reason molecule similarities.

Assuntos

Quimioinformática , Redes Neurais de Computação , Química Computacional , Aprendizado de Máquina

Isolating Specific vs. Non-Specific Binding Responses in Conducting Polymer Biosensors for Bio-Fingerprinting.

Smith, Phil M; Sutradhar, Indorica; Telmer, Maxwell; Magar, Rishikesh; Farimani, Amir Barati; Reeja-Jayan, B.

Sensors (Basel) ; 21(19)2021 Sep 22.

Artigo em Inglês | MEDLINE | ID: mdl-34640658

RESUMO

A longstanding challenge for accurate sensing of biomolecules such as proteins concerns specifically detecting a target analyte in a complex sample (e.g., food) without suffering from nonspecific binding or interactions from the target itself or other analytes present in the sample. Every sensor suffers from this fundamental drawback, which limits its sensitivity, specificity, and longevity. Existing efforts to improve signal-to-noise ratio involve introducing additional steps to reduce nonspecific binding, which increases the cost of the sensor. Conducting polymer-based chemiresistive biosensors can be mechanically flexible, are inexpensive, label-free, and capable of detecting specific biomolecules in complex samples without purification steps, making them very versatile. In this paper, a poly (3,4-ethylenedioxyphene) (PEDOT) and poly (3-thiopheneethanol) (3TE) interpenetrating network on polypropylene-cellulose fabric is used as a platform for a chemiresistive biosensor, and the specific and nonspecific binding events are studied using the Biotin/Avidin and Gliadin/G12-specific complementary binding pairs. We observed that specific binding between these pairs results in a negative ΔR with the addition of the analyte and this response increases with increasing analyte concentration. Nonspecific binding was found to have the opposite response, a positive ΔR upon the addition of analyte was seen in nonspecific binding cases. We further demonstrate the ability of the sensor to detect a targeted protein in a dual-protein analyte solution. The machine-learning classifier, random forest, predicted the presence of Biotin with 75% accuracy in dual-analyte solutions. This capability of distinguishing between specific and nonspecific binding can be a step towards solving the problem of false positives or false negatives to which all biosensors are susceptible.

Assuntos

Técnicas Biossensoriais , Polímeros , Biotina , Proteínas

Forecasting COVID-19 new cases using deep learning methods.

Xu, Lu; Magar, Rishikesh; Barati Farimani, Amir.

Comput Biol Med ; 144: 105342, 2022 05.

Artigo em Inglês | MEDLINE | ID: mdl-35247764

RESUMO

After nearly two years since the first identification of SARS-CoV-2 virus, the surge in cases because of virus mutations is a cause of grave public health concern across the globe. As a result of this health crisis, predicting the transmission pattern of the virus is one of the most vital tasks for preparing and controlling the pandemic. In addition to mathematical models, machine learning tools, especially deep learning models have been developed for forecasting the trend of the number of patients affected by SARS-CoV-2 with great success. In this paper, three deep learning models, including CNN, LSTM, and the CNN-LSTM have been developed to predict the number of COVID-19 cases for Brazil, India and Russia. We also compare the performance of our models with the previously developed deep learning models and notice significant improvements in prediction performance. Although our models have been used only for forecasting cases in these three countries, the models can be easily applied to datasets of other countries. Among the models developed in this work, the LSTM model has the highest performance when forecasting and shows an improvement in the forecasting accuracy compared with some existing models. The research will enable accurate forecasting of the COVID-19 cases and support the global fight against the pandemic.

Assuntos

COVID-19 , Aprendizado Profundo , COVID-19/epidemiologia , Previsões , Humanos , Pandemias , SARS-CoV-2

W-Net: Dense and diagnostic semantic segmentation of subcutaneous and breast tissue in ultrasound images by incorporating ultrasound RF waveform data.

Gare, Gautam Rajendrakumar; Li, Jiayuan; Joshi, Rohan; Magar, Rishikesh; Vaze, Mrunal Prashant; Yousefpour, Michael; Rodriguez, Ricardo Luis; Galeotti, John Michael.

Med Image Anal ; 76: 102326, 2022 02.

Artigo em Inglês | MEDLINE | ID: mdl-34936967

RESUMO

We study the use of raw ultrasound waveforms, often referred to as the "Radio Frequency" (RF) data, for the semantic segmentation of ultrasound scans to carry out dense and diagnostic labeling. We present W-Net, a novel Convolution Neural Network (CNN) framework that employs the raw ultrasound waveforms in addition to the grey ultrasound image to semantically segment and label tissues for anatomical, pathological, or other diagnostic purposes. To the best of our knowledge, this is also the first deep-learning or CNN approach for segmentation that analyzes ultrasound raw RF data along with the grey image. We chose subcutaneous tissue (SubQ) segmentation as our initial clinical goal for dense segmentation since it has diverse intermixed tissues, is challenging to segment, and is an underrepresented research area. SubQ potential applications include plastic surgery, adipose stem-cell harvesting, lymphatic monitoring, and possibly detection/treatment of certain types of tumors. Unlike prior work, we seek to label every pixel in the image, without the use of a background class. A custom dataset consisting of hand-labeled images by an expert clinician and trainees are used for the experimentation, currently labeled into the following categories: skin, fat, fat fascia/stroma, muscle, and muscle fascia. We compared W-Net and attention variant of W-Net (AW-Net) with U-Net and Attention U-Net (AU-Net). Our novel W-Net's RF-Waveform encoding architecture outperformed regular U-Net and AU-Net, achieving the best mIoU accuracy (averaged across all tissue classes). We study the impact of RF data on dense labeling of the SubQ region, which is followed by the analyses of the generalization capability of the networks to patients and analysis on the SubQ tissue classes, determining that fascia tissues, especially muscle fascia in particular, are the most difficult anatomic class to recognize for both humans and AI algorithms. We present diagnostic semantic segmentation, which is semantic segmentation carried out for the purposes of direct diagnostic pixel labeling, and apply it to breast tumor detection task on a publicly available dataset to segment pixels into malignant tumor, benign tumor, and background tissue class. Using the segmented image we diagnose the patient by classifying the breast lesion as either benign or malignant. We demonstrate the diagnostic capability of RF data with the use of W-Net, which achieves the best segmentation scores across all classes.

Assuntos

Semântica , Tela Subcutânea , Humanos , Processamento de Imagem Assistida por Computador/métodos , Redes Neurais de Computação , Ultrassonografia

Potential neutralizing antibodies discovered for novel corona virus using machine learning.

Magar, Rishikesh; Yadav, Prakarsh; Barati Farimani, Amir.

Sci Rep ; 11(1): 5261, 2021 03 04.

Artigo em Inglês | MEDLINE | ID: mdl-33664393

RESUMO

The fast and untraceable virus mutations take lives of thousands of people before the immune system can produce the inhibitory antibody. The recent outbreak of COVID-19 infected and killed thousands of people in the world. Rapid methods in finding peptides or antibody sequences that can inhibit the viral epitopes of SARS-CoV-2 will save the life of thousands. To predict neutralizing antibodies for SARS-CoV-2 in a high-throughput manner, in this paper, we use different machine learning (ML) model to predict the possible inhibitory synthetic antibodies for SARS-CoV-2. We collected 1933 virus-antibody sequences and their clinical patient neutralization response and trained an ML model to predict the antibody response. Using graph featurization with variety of ML methods, like XGBoost, Random Forest, Multilayered Perceptron, Support Vector Machine and Logistic Regression, we screened thousands of hypothetical antibody sequences and found nine stable antibodies that potentially inhibit SARS-CoV-2. We combined bioinformatics, structural biology, and Molecular Dynamics (MD) simulations to verify the stability of the candidate antibodies that can inhibit SARS-CoV-2.

Assuntos

Anticorpos Neutralizantes , Aprendizado de Máquina , SARS-CoV-2/imunologia , Ensaios de Triagem em Larga Escala/métodos , SARS-CoV-2/genética

Understanding mutation hotspots for the SARS-CoV-2 spike protein using Shannon Entropy and K-means clustering.

Mullick, Baishali; Magar, Rishikesh; Jhunjhunwala, Aastha; Barati Farimani, Amir.

Comput Biol Med ; 138: 104915, 2021 11.

Artigo em Inglês | MEDLINE | ID: mdl-34655896

RESUMO

The SARS-CoV-2 virus like many other viruses has transformed in a continual manner to give rise to new variants by means of mutations commonly through substitutions and indels. These mutations in some cases can give the virus a survival advantage making the mutants dangerous. In general, laboratory investigation must be carried to determine whether the new variants have any characteristics that can make them more lethal and contagious. Therefore, complex and time-consuming analyses are required in order to delve deeper into the exact impact of a particular mutation. The time required for these analyses makes it difficult to understand the variants of concern and thereby limiting the preventive action that can be taken against them spreading rapidly. In this analysis, we have deployed a statistical technique Shannon Entropy, to identify positions in the spike protein of SARS Cov-2 viral sequence which are most susceptible to mutations. Subsequently, we also use machine learning based clustering techniques to cluster known dangerous mutations based on similarities in properties. This work utilizes embeddings generated using language modeling, the ProtBERT model, to identify mutations of a similar nature and to pick out regions of interest based on proneness to change. Our entropy-based analysis successfully predicted the fifteen hotspot regions, among which we were able to validate ten known variants of interest, in six hotspot regions. As the situation of SARS-COV-2 virus rapidly evolves we believe that the remaining nine mutational hotspots may contain variants that can emerge in the future. We believe that this may be promising in helping the research community to devise therapeutics based on probable new mutation zones in the viral sequence and resemblance in properties of various mutations.

Assuntos

COVID-19 , Glicoproteína da Espícula de Coronavírus , Análise por Conglomerados , Entropia , Humanos , Mutação , SARS-CoV-2 , Glicoproteína da Espícula de Coronavírus/genética

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA