Your browser doesn't support javascript.
loading
Montrer: 20 | 50 | 100
Résultats 1 - 20 de 52
Filtrer
1.
AMIA Jt Summits Transl Sci Proc ; 2024: 409-418, 2024.
Article de Anglais | MEDLINE | ID: mdl-38827107

RÉSUMÉ

Cancer outcomes are poor in resource-limited countries owing to high costs and insufficient pathologist-population ratio. The advent of digital pathology has assisted in improving cancer outcomes, however, Whole Slide Image scanners are expensive and not affordable in low-income countries. Microscope-acquired images on the other hand are cheap to collect and can be more viable for automation of cancer detection. In this study, we propose LCH-Network, a novel method to identify the cancer mitotic count from microscope-acquired images. We introduced Label Mix, and also synthesized images using GANs to handle data imbalance. Moreover, we applied progressive resolution to handle different image scales for mitotic localization. We achieved F1-Score of 0.71 and outperformed other existing techniques. Our findings enable mitotic count estimation from microscopic images with a low-cost setup. Clinically, our method could help avoid presumptive treatment without a confirmed cancer diagnosis.

2.
Digit Health ; 10: 20552076241255471, 2024.
Article de Anglais | MEDLINE | ID: mdl-38778869

RÉSUMÉ

Objective: The mitotic activity index is an important prognostic factor in the diagnosis of cancer. The task of mitosis detection is difficult as the nuclei are microscopic in size and partially labeled, and there are many more non-mitotic nuclei compared to mitotic ones. In this paper, we highlight the challenges of current mitosis detection pipelines and propose a method to tackle these challenges. Methods: Our proposed methodology is inspired from recent research on deep learning and an extensive analysis on the dataset and training pipeline. We first used the MiDoG'22 dataset for training, validation, and testing. We then tested the methodology without fine-tuning on the TUPAC'16 dataset and on a real-time case from Shaukat Khanum Memorial Cancer Hospital and Research Centre. Results: Our methodology has shown promising results both quantitatively and qualitatively. Quantitatively, our methodology achieved an F1-score of 0.87 on the MiDoG'22 dataset and an F1-score of 0.83 on the TUPAC dataset. Qualitatively, our methodology is generalizable and interpretable across various datasets and clinical settings. Conclusion: In this paper, we highlight the challenges of current mitosis detection pipelines and propose a method that can accurately predict mitotic nuclei. We illustrate the accuracy, generalizability, and interpretability of our approach across various datasets and clinical settings. Our methodology can speed up the adoption of computer-aided digital pathology in clinical settings.

3.
J Biomol Struct Dyn ; : 1-9, 2024 Jan 12.
Article de Anglais | MEDLINE | ID: mdl-38214492

RÉSUMÉ

High throughput protein-protein interaction (PPI) profiling and computational techniques have resulted in generating a large amount of PPI network data. The study of PPI networks helps in understanding the biological processes of the proteins. The comparative study of the PPI networks helps in identifying the conserved interactions across the species. This article presents a novel local PPI network aligner 'GSLAlign' that consists of two stages. It first detects the communities from the PPI networks by applying the GraphSAGE algorithm using gene expression data. In the second stage, the detected communities are aligned using a community aligner that is based on protein sequence similarity. The community detection algorithm produces more separable and biologically accurate communities as compared to previous community detection algorithms. Moreover, the proposed community alignment algorithm achieves 3-8% better results in terms of semantic similarity as compared to previous local aligners. The average connectivity and coverage of the proposed algorithm are also better than the existing aligners.Communicated by Ramaswamy H. Sarma.

4.
Article de Anglais | MEDLINE | ID: mdl-38083556

RÉSUMÉ

Recent advances in Natural Language Processing (NLP) have produced state of the art results on several sequence to sequence (seq2seq) tasks. Enhancements in embedders and their training methodologies have shown significant improvement on downstream tasks. Word vector models like Word2Vec, FastText & Glove were widely used over one-hot encoded vectors for years until the advent of deep contextualized embedders. Protein sequences consist of 20 naturally occurring amino acids that can be treated as the language of nature. These amino acids in combinations with each other makeup the biological functions. The choice of vector representation and architecture design for a biological task is highly dependent upon the nature of the task. We utilize unlabelled protein sequences to train a Convolution and Gated Recurrent Network (CGRN) embedder using Masked Language Modeling (MLM) technique that shows significant performance boost under resource constraint setting on two downstream tasks i.e., F1-score(Q8) of 73.1% on Secondary Structure Prediction (SSP) & F1-score of 84% on Intrinsically Disordered Region Prediction (IDRP). We also compare different architectures on downstream tasks to show the impact of the nature of biological task on the performance of the model.


Sujet(s)
Langage , Traitement du langage naturel , Séquence d'acides aminés , Unified medical language system (USA) , Acides aminés
5.
J Biomol Struct Dyn ; : 1-10, 2023 Oct 03.
Article de Anglais | MEDLINE | ID: mdl-37787617

RÉSUMÉ

Multidrug efflux is a well-established mechanism of drug resistance in bacterial pathogens like Salmonella Typhi. styMdtM (locus name; STY4874) is a multidrug efflux transporter of the major facilitator superfamily expressed in S. Typhi. Functional assays identified several residues important for its transport activity. Here, we used an AlphaFold model to identify additional residues for analysis by mutagenesis. Mutation of peripheral residue Cys185 had no effect on the structure or function of the transporter. However, substitution of channel-lining residues Tyr29 and Tyr231 completely abolished transport function. Finally, mutation of Gln294, which faces peripheral helices of the transporter, resulted in the loss of transport of some substrates. Crystallization studies yielded diffraction data for the wild-type protein at 4.5 Å resolution and allowed the unit cell parameters to be established as a = b = 64.3 Å, c = 245.4 Å, α = ß = γ = 90°, in space group P4. Our studies represent a further stepping stone towards a mechanistic understanding of the clinically important multidrug transporter styMdtM.Communicated by Ramaswamy H. Sarma.

6.
Methods Mol Biol ; 2627: 321-328, 2023.
Article de Anglais | MEDLINE | ID: mdl-36959455

RÉSUMÉ

ß-barrel membrane proteins (ßMPs), found in the outer membrane of gram-negative bacteria, mitochondria, and chloroplasts, play important roles in membrane anchoring, pore formation, and enzyme activities. However, it is often difficult to determine their structures experimentally, and the knowledge of their structures is currently limited. We have developed a method to predict the 3D architectures of ßMPs. We can accurately construct transmembrane domains of ßMPs by predicting their strand registers, from which full 3D atomic structures are derived. Using 3D Beta-barrel Membrane Protein Predictor (3D-BMPP), we can further accurately model the extended beta barrels and loops in non-TM regions with overall greater structure prediction coverage. 3DBMPP is a general technique that can be applied to protein families with limited sequences as well as proteins with novel folds. Applications of 3DBMPP can be broadly applied to genome-wide ßMPs structure prediction.


Sujet(s)
Protéines de la membrane externe bactérienne , Protéines membranaires , Protéines membranaires/génétique , Protéines membranaires/composition chimique , Domaines protéiques , Protéines de la membrane externe bactérienne/génétique , Protéines de la membrane externe bactérienne/composition chimique
7.
Sci Rep ; 13(1): 806, 2023 01 16.
Article de Anglais | MEDLINE | ID: mdl-36646775

RÉSUMÉ

Long non-coding RNAs (lncRNAs), which were once considered as transcriptional noise, are now in the limelight of current research. LncRNAs play a major role in regulating various biological processes such as imprinting, cell differentiation, and splicing. The mutations of lncRNAs are involved in various complex diseases. Identifying lncRNA-disease associations has gained a lot of attention as predicting it efficiently will lead towards better disease treatment. In this study, we have developed a machine learning model that predicts disease-related lncRNAs by combining sequence and structure-based features. The features were trained on SVM and Random Forest classifiers. We have compared our method with the state-of-the-art and obtained the highest F1 score of 76% on SVM classifier. Moreover, this study has overcome two serious limitations of the reported method which are lack of redundancy checking and implementation of oversampling for balancing the positive and negative class. Our method has achieved improved performance among machine learning models reported for lncRNA-disease associations. Combining multiple features together specifically lncRNAs sequence mutation has a significant contribution to the disease related lncRNA prediction.


Sujet(s)
ARN long non codant , ARN long non codant/génétique , Biologie informatique/méthodes , Apprentissage machine , Forêts aléatoires , Différenciation cellulaire
8.
Front Mol Biosci ; 9: 928530, 2022.
Article de Anglais | MEDLINE | ID: mdl-36032678

RÉSUMÉ

The linguistic rules of medical terminology assist in gaining acquaintance with rare/complex clinical and biomedical terms. The medical language follows a Greek and Latin-inspired nomenclature. This nomenclature aids the stakeholders in simplifying the medical terms and gaining semantic familiarity. However, natural language processing models misrepresent rare and complex biomedical words. In this study, we present MedTCS-a lightweight, post-processing module-to simplify hybridized or compound terms into regular words using medical nomenclature. MedTCS enabled the word-based embedding models to achieve 100% coverage and enabled the BiowordVec model to achieve high correlation scores (0.641 and 0.603 in UMNSRS similarity and relatedness datasets, respectively) that significantly surpass the n-gram and sub-word approaches of FastText and BERT. In the downstream task of named entity recognition (NER), MedTCS enabled the latest clinical embedding model of FastText-OA-All-300d to improve the F1-score from 0.45 to 0.80 on the BC5CDR corpus and from 0.59 to 0.81 on the NCBI-Disease corpus, respectively. Similarly, in the drug indication classification task, our model was able to increase the coverage by 9% and the F1-score by 1%. Our results indicate that incorporating a medical terminology-based module provides distinctive contextual clues to enhance vocabulary as a post-processing step on pre-trained embeddings. We demonstrate that the proposed module enables the word embedding models to generate vectors of out-of-vocabulary words effectively. We expect that our study can be a stepping stone for the use of biomedical knowledge-driven resources in NLP.

9.
Evol Bioinform Online ; 18: 11769343221110658, 2022.
Article de Anglais | MEDLINE | ID: mdl-35898232

RÉSUMÉ

Motivation: The advancement of high-throughput PPI profiling techniques results in generating a large amount of PPI data. The alignment of the PPI networks uncovers the relationship between the species that can help understand the biological systems. The comparative study reveals the conserved biological interactions of the proteins across the species. It can also help study the biological pathways and signal networks of the cells. Although several network alignment algorithms are developed to study and compare the PPI data, the development of the aligner that aligns the PPI networks with high biological similarity and coverage is still challenging. Results: This paper presents a novel global network alignment algorithm, BioAlign, that incorporates a significant amount of biological information. Existing studies use global sequence and/or 3D-structure similarity to align the PPI networks. In contrast, BioAlign uses the local sequence similarity, predicted secondary structure motifs, and remote homology in addition to global sequence and 3D-structure similarity. The extra sources of biological information help BioAlign to align the proteins with high biological similarity. BioAlign produces significantly better results in terms of AFS and Coverage (6-32 and 7-34 with respect to MF and BP, respectively) than the existing algorithms. BioAlign aligns a much larger number of proteins that have high biological similarities as compared to the existing aligners. BioAlign helps in studying the functionally similar protein pairs across the species.

10.
Sci Rep ; 12(1): 3818, 2022 03 09.
Article de Anglais | MEDLINE | ID: mdl-35264663

RÉSUMÉ

The Gene Ontology (GO) is a controlled vocabulary that captures the semantics or context of an entity based on its functional role. Biomedical entities are frequently compared to each other to find similarities to help in data annotation and knowledge transfer. In this study, we propose GOntoSim, a novel method to determine the functional similarity between genes. GOntoSim quantifies the similarity between pairs of GO terms, by taking the graph structure and the information content of nodes into consideration. Our measure quantifies the similarity between the ancestors of the GO terms accurately. It also takes into account the common children of the GO terms. GOntoSim is evaluated using the entire Enzyme Dataset containing 10,890 proteins and 97,544 GO annotations. The enzymes are clustered and compared with the Gold Standard EC numbers. At level 1 of the EC Numbers for Molecular Function, GOntoSim achieves a purity score of 0.75 as compared to 0.47 and 0.51 GOGO and Wang. GOntoSim can handle the noisy IEA annotations. We achieve a purity score of 0.94 in contrast to 0.48 for both GOGO and Wang at level 1 of the EC Numbers with IEA annotations. GOntoSim can be freely accessed at ( http://www.cbrlab.org/GOntoSim.html ).


Sujet(s)
Biologie informatique , Sémantique , Algorithmes , Enfant , Biologie informatique/méthodes , Gene Ontology , Humains , Annotation de séquence moléculaire
11.
J Biomol Struct Dyn ; 40(3): 1205-1215, 2022 02.
Article de Anglais | MEDLINE | ID: mdl-32964802

RÉSUMÉ

COVID-19 an outbreak of a novel corona virus originating from Wuhan, China in December 2019 has now spread across the entire world and has been declared a pandemic by WHO. Angiotensin converting enzyme 2 (ACE2) is a receptor protein that interacts with the spike glycoprotein of the host to facilitate the entry of coronavirus (SARS-CoV-2) hence causing the disease (COVID-19). Our experimental design is based on bioinformatics approach that combines sequence, structure and consensus based tools to label a protein coding single nucleotide polymorphism (SNP) as damaging/deleterious or neutral. The interaction of wildtype ACE2-spike glycoprotein and their variants were analyzed using docking studies. The mutations W461R, G405E and F588S in ACE2 receptor protein and population specific mutations P391S, C12S and G1223A in the spike glycoprotein were predicted as highly destabilizing to the structure of the bound complex. So far, no extensive in silico study has been reported that identifies the effect of SNPs on Spike glycoprotein-ACE2 interaction exploring both sequence and structural features. To this end, this study conducted an in-depth analysis that facilitates in identifying the mutations that blocks the interaction of two proteins that can result in stopping the virus from entering the host cell.Communicated by Ramaswamy H. Sarma.


Sujet(s)
Angiotensin-converting enzyme 2 , COVID-19 , Polymorphisme de nucléotide simple , Glycoprotéine de spicule des coronavirus , Humains , Simulation de docking moléculaire , Liaison aux protéines , ARN viral , SARS-CoV-2 , Glycoprotéine de spicule des coronavirus/génétique , Glycoprotéine de spicule des coronavirus/métabolisme , Pénétration virale
12.
Annu Int Conf IEEE Eng Med Biol Soc ; 2021: 2025-2029, 2021 11.
Article de Anglais | MEDLINE | ID: mdl-34891685

RÉSUMÉ

Electroencephalogram (EEG) is a widely used technique to diagnose psychological disorders. Until now, most of the studies focused on the diagnosis of a particular psychological disorder using EEG. We propose a generic approach to diagnose the different type of psychological disorders with high accuracy. The proposed approach is tested on five different datasets and three psychological disorders. Electrodes having higher signal to noise ratio are selected from the raw EEG signals. Multiple linear and non-linear features are then extracted from the selected electrodes. After feature selection, machine learning is used to diagnose the psychological disorders. We kept the same generic approach for all the datasets and diseases and achieved 93%, 85% and 80% F1 score on Schizophrenia, Epilepsy and Parkinson disease, respectively.


Sujet(s)
Épilepsie , Traitement du signal assisté par ordinateur , Algorithmes , Électroencéphalographie , Épilepsie/diagnostic , Humains , Machine à vecteur de support
13.
Annu Int Conf IEEE Eng Med Biol Soc ; 2021: 2100-2103, 2021 11.
Article de Anglais | MEDLINE | ID: mdl-34891703

RÉSUMÉ

Long non-coding RNAs have generated much scientific interest because of their functional significance in regulating various biological processes and also their dysfunction has been implicated in disease progression. LncRNAs usually bind with proteins to perform their function. The experimental approaches for identifying these interactions are time taking and expensive. Lately, numerous method on predicting lncRNA-protein interactions have been reported yet, they all have some prevalent drawbacks that limit their prediction performance. In this research, we proposed a computational method based on a similarity scheme that integrates features derived from sequence and structure similarities. When compared with the state of the art, the proposed method has achieved highest performance with accuracy and F1 measure of 98.6% and 98.7% using XGBoost as classifier. Our results showed that by combining sequence and structure based features the lncRNA protein interactions can be better predicted and can also complement the experimental techniques for this task.Clinical Relevance- The lncRNA-protein interactions play significant role in regulating various biological processes. This can help in providing early diagnosis and better treatment for cancer related diseases.


Sujet(s)
ARN long non codant , Biologie informatique , Apprentissage machine , ARN long non codant/génétique
14.
Annu Int Conf IEEE Eng Med Biol Soc ; 2021: 4139-4142, 2021 11.
Article de Anglais | MEDLINE | ID: mdl-34892137

RÉSUMÉ

Notch signaling is responsible for creating contrasting states of differentiation among neighboring cells during organism's early development. Various factors can affect this highly conserved intercellular signaling pathway, for the formation of fine-grained pattern in cell tissues. As cells undergo dramatic structural changes during development, one of the factors that can influence cell-cell communication is cell morphology. In this study, we elucidate the role of cell morphology on mosaic pattern formation in a realistic epithelial layer cell model. We discovered that cell signaling strength is inversely related to the cell area, such that smaller cells have higher probability/tendency of becoming signal producing cells as compared to larger cells during early embryonic days. In a nutshell, our work highlights the role of cell morphology on the stochastic cell fate decision process in the epithelial layer of multicellular organisms.


Sujet(s)
Communication cellulaire , Transduction du signal , Différenciation cellulaire , Processus stochastiques
15.
Annu Int Conf IEEE Eng Med Biol Soc ; 2021: 4143-4146, 2021 11.
Article de Anglais | MEDLINE | ID: mdl-34892138

RÉSUMÉ

Notch signaling (NS) determines the fate of adjacent cells during metazoans development. This intercellular signaling mechanism regulates diverse development processes like cell differentiation, proliferation, survival and is considered responsible for maintaining cellular homeostasis. In this study, we elucidate the role of Notch heterogeneity (NH) in cell fate determination. We studied the role of NH at intercellular, intracellular and the coexistence of Notch variation simultaneously at the intracellular and intercellular level in direct cell-cell signaling on an irregular cell mosaic. In addition, the effect of intracellular Notch receptor diffusion on an irregular cell lattice is also taken into account during Delta-Notch lateral inhibition (LI) process. Through mathematical and computational models, we discovered that the classical checkerboard pattern formation can be reproduced with an accuracy of 70-81% by accounting for NH in a realistic epithelial layer of multicellular organisms.


Sujet(s)
Protéines membranaires , Récepteurs Notch , Communication cellulaire , Différenciation cellulaire , Transduction du signal
16.
Protein Sci ; 30(9): 1935-1945, 2021 09.
Article de Anglais | MEDLINE | ID: mdl-34118089

RÉSUMÉ

Enzymes are critical proteins in every organism. They speed up essential chemical reactions, help fight diseases, and have a wide use in the pharmaceutical and manufacturing industries. Wet lab experiments to figure out an enzyme's function are time consuming and expensive. Therefore, the need for computational approaches to address this problem are becoming necessary. Usually, an enzyme is extremely specific in performing its function. However, there exist enzymes that can perform multiple functions. A multi-functional enzyme has vast potential as it reduces the need to discover/use different enzymes for different functions. We propose an approach to predict a multi-functional enzyme's function up to the most specific fourth level of the hierarchy of the Enzyme Commission (EC) number. Previous studies can only predict the function of the enzyme till level 1. Using a dataset of 2,583 multi-functional enzymes, we achieved a hierarchical subset accuracy of 71.4% and a Macro F1 Score of 96.1% at the fourth level. The robustness of the network was further tested on a multi-functional isoforms dataset. Our method is broadly applicable and may be used to discover better enzymes. The web-server can be freely accessed at http://hecnet.cbrlab.org/.


Sujet(s)
Apprentissage profond , Enzymes/composition chimique , Enzymes/classification , Biocatalyse , Jeux de données comme sujet , Enzymes/métabolisme , Relation structure-activité , Terminologie comme sujet
17.
Genomics Proteomics Bioinformatics ; 19(6): 986-997, 2021 12.
Article de Anglais | MEDLINE | ID: mdl-33794377

RÉSUMÉ

Current FDA-approved kinase inhibitors cause diverse adverse effects, some of which are due to the mechanism-independent effects of these drugs. Identifying these mechanism-independent interactions could improve drug safety and support drug repurposing. Here, we develop iDTPnd (integrated Drug Target Predictor with negative dataset), a computational approach for large-scale discovery of novel targets for known drugs. For a given drug, we construct a positive structural signature as well as a negative structural signature that captures the weakly conserved structural features of drug-binding sites. To facilitate assessment of unintended targets, iDTPnd also provides a docking-based interaction score and its statistical significance. We confirm the interactions of sorafenib, imatinib, dasatinib, sunitinib, and pazopanib with their known targets at a sensitivity of 52% and a specificity of 55%. We also validate 10 predicted novel targets by using in vitro experiments. Our results suggest that proteins other than kinases, such as nuclear receptors, cytochrome P450, and MHC class I molecules, can also be physiologically relevant targets of kinase inhibitors. Our method is general and broadly applicable for the identification of protein-small molecule interactions, when sufficient drug-target 3D data are available. The code for constructing the structural signatures is available at https://sfb.kaust.edu.sa/Documents/iDTP.zip.


Sujet(s)
Protéines , Protéines/métabolisme
18.
Comput Med Imaging Graph ; 89: 101863, 2021 04.
Article de Anglais | MEDLINE | ID: mdl-33578222

RÉSUMÉ

The mortality rate of Breast Cancer in women has increased, both in west and east. Early detection is important to improve the survival rate of cancer patients. The manual detection and identification of cancer in whole slide images are critical and difficult tasks for pathologists. In this work, we introduce PMNet, a pipeline to detect regions with invasive characteristics in whole slide images. Our method employs scaled networks for detecting breast cancer in whole slide images. It classifies whole slide images on patch level into normal, benign, in situ and invasive tumors. Our approach yielded f1-score of 88.9(±1.7)% that outperforms the benchmark f1-score of 81.2(±1.3)% on patch level and achieved an average dice coefficient of 69.8% on 10 whole slide images compared to the benchmark average dice coefficient of 61.5% on BACH dataset. Similarly, on the dryad test dataset that comprises of 173 whole slide images, we achieved an average dice coefficient of 82.7% as compared to the previous state-of-art of 76% without fine-tuning on this dataset. We further proposed a method to generate patch level annotations for the image level TCGA breast cancer database that will be useful for future deep learning methods.


Sujet(s)
Tumeurs du sein , Tumeurs du sein/imagerie diagnostique , Femelle , Humains , Probabilité
19.
BMC Bioinformatics ; 21(1): 500, 2020 Nov 04.
Article de Anglais | MEDLINE | ID: mdl-33148180

RÉSUMÉ

BACKGROUND: High throughput experiments have generated a significantly large amount of protein interaction data, which is being used to study protein networks. Studying complete protein networks can reveal more insight about healthy/disease states than studying proteins in isolation. Similarly, a comparative study of protein-protein interaction (PPI) networks of different species reveals important insights which may help in disease analysis and drug design. The study of PPI network alignment can also helps in understanding the different biological systems of different species. It can also be used in transfer of knowledge across different species. Different aligners have been introduced in the last decade but developing an accurate and scalable global alignment algorithm that can ensures the biological significance alignment is still challenging. RESULTS: This paper presents a novel global pairwise network alignment algorithm, SAlign, which uses topological and biological information in the alignment process. The proposed algorithm incorporates sequence and structural information for computing biological scores, whereas previous algorithms only use sequence information. The alignment based on the proposed technique shows that the combined effect of structure and sequence results in significantly better pairwise alignments. We have compared SAlign with state-of-art algorithms on the basis of semantic similarity of alignment and the number of aligned nodes on multiple PPI network pairs. The results of SAlign on the network pairs which have high percentage of proteins with available structure are 3-63% semantically better than all existing techniques. Furthermore, it also aligns 5-14% more nodes of these network pairs as compared to existing aligners. The results of SAlign on other PPI network pairs are comparable or better than all existing techniques. We also introduce [Formula: see text], a Monte Carlo based alignment algorithm, that produces multiple network alignments with similar semantic similarity. This helps the user to pick biologically meaningful alignments. CONCLUSION: The proposed algorithm has the ability to find the alignments that are more biologically significant/relevant as compared to the alignments of existing aligners. Furthermore, the proposed method is able to generate alternate alignments that help in studying different genes/proteins of the specie.


Sujet(s)
Algorithmes , Cartes d'interactions protéiques , Protéines/métabolisme , Animaux , Bases de données de protéines , Humains , Souris , Méthode de Monte Carlo , Protéines/composition chimique , Levures/métabolisme
20.
Annu Int Conf IEEE Eng Med Biol Soc ; 2020: 5842-5846, 2020 07.
Article de Anglais | MEDLINE | ID: mdl-33019302

RÉSUMÉ

DNA-Sequencing of tumor cells has revealed thousands of genetic mutations. However, cancer is caused by only some of them. Identifying mutations that contribute to tumor growth from neutral ones is extremely challenging and is currently carried out manually. This manual annotation is very cumbersome and expensive in terms of time and money. In this study, we introduce a novel method "NLP-SNPPred" to read scientific literature and learn the implicit features that cause certain variations to be pathogenic. Precisely, our method ingests the bio-medical literature and produces its vector representation via exploiting state of the art NLP methods like sent2vec, word2vec and tf-idf. These representations are then fed to machine learning predictors to identify the pathogenic versus neutral variations. Our best model (NLPSNPPred) trained on OncoKB and evaluated on several publicly available benchmark datasets, outperformed state of the art function prediction methods. Our results show that NLP can be used effectively in predicting functional impact of protein coding variations with minimal complementary biological features. Moreover, encoding biological knowledge into the right representations, combined with machine learning methods can help in automating manual efforts. A free to use web-server is available at http://www.nlp-snppred.cbrlab.org.


Sujet(s)
Traitement du langage naturel , Protéines , Apprentissage machine , Mutation , Virulence
SÉLECTION CITATIONS
DÉTAIL DE RECHERCHE
...