RESUMO
SUMMARY: Imagine if we could simultaneously predict spatial protein expression in tissues from their routine Hematoxylin and Eosin (H&E) stained images, and create tissue images given protein expression profiles thus enabling virtual simulations of how protein expression alterations impact histology in complex diseases like cancer. Such an approach could lead to more informed diagnostic and therapeutic decisions for precision medicine at lower costs and shorter turnaround times, more detailed insights into underlying disease pathology as well as improvement in predictive and generative performance. In this study, we investigate the intricate correlation between protein expressions obtained from Hyperion mass cytometry and histopathological microstructures in conventional H&E stained glioblastoma (GBM) samples, unveiling morphological patterns and cellular-level spatial alterations associated with protein expression changes. To model these complex relationships, we propose a novel generative-predictive framework called Ouroboros for producing H&E images from protein expressions and simultaneously predicting protein expressions from H&E images. Our comprehensive sample-independent validation over 9920 tissue spots from 4 GBM samples encompassing visual image analysis, quantitative analysis, subspace alignment and perturbation experiments shows that the proposed generative-predictive approach offers significant improvements in predicting protein expression from images in comparison to baseline methods as well as accurate generation of virtual GBM sample images. This proof of concept study can contribute to advancing our understanding of histological responses to protein expression perturbations and lays the foundations for further developments in this area. AVAILABILITY AND IMPLEMENTATION: Implementation and associated data for the proposed approach are available at the URL: https://github.com/Srijay/Ouroboros.
Assuntos
Glioblastoma , Humanos , Glioblastoma/metabolismo , Glioblastoma/patologia , Glioblastoma/diagnóstico por imagem , Processamento de Imagem Assistida por Computador/métodos , Neoplasias Encefálicas/metabolismo , Neoplasias Encefálicas/patologia , Biologia Computacional/métodosRESUMO
In recent years, artificial intelligence (AI) has demonstrated exceptional performance in mitosis identification and quantification. However, the implementation of AI in clinical practice needs to be evaluated against the existing methods. This study is aimed at assessing the optimal method of using AI-based mitotic figure scoring in breast cancer (BC). We utilized whole slide images from a large cohort of BC with extended follow-up comprising a discovery (n = 1715) and a validation (n = 859) set (Nottingham cohort). The Cancer Genome Atlas of breast invasive carcinoma (TCGA-BRCA) cohort (n = 757) was used as an external test set. Employing automated mitosis detection, the mitotic count was assessed using 3 different methods, the mitotic count per tumor area (MCT; calculated by dividing the number of mitotic figures by the total tumor area), the mitotic index (MI; defined as the average number of mitotic figures per 1000 malignant cells), and the mitotic activity index (MAI; defined as the number of mitotic figures in 3 mm2 area within the mitotic hotspot). These automated metrics were evaluated and compared based on their correlation with the well-established visual scoring method of the Nottingham grading system and Ki67 score, clinicopathologic parameters, and patient outcomes. AI-based mitotic scores derived from the 3 methods (MCT, MI, and MAI) were significantly correlated with the clinicopathologic characteristics and patient survival (P < .001). However, the mitotic counts and the derived cutoffs varied significantly between the 3 methods. Only MAI and MCT were positively correlated with the gold standard visual scoring method used in Nottingham grading system (r = 0.8 and r = 0.7, respectively) and Ki67 scores (r = 0.69 and r = 0.55, respectively), and MAI was the only independent predictor of survival (P < .05) in multivariate Cox regression analysis. For clinical applications, the optimum method of scoring mitosis using AI needs to be considered. MAI can provide reliable and reproducible results and can accurately quantify mitotic figures in BC.
Assuntos
Neoplasias da Mama , Humanos , Feminino , Neoplasias da Mama/patologia , Antígeno Ki-67 , Inteligência Artificial , Mitose , Índice MitóticoRESUMO
Computational pathology is currently witnessing a surge in the development of AI techniques, offering promise for achieving breakthroughs and significantly impacting the practices of pathology and oncology. These AI methods bring with them the potential to revolutionize diagnostic pipelines as well as treatment planning and overall patient care. Numerous peer-reviewed studies reporting remarkable performance across diverse tasks serve as a testimony to the potential of AI in the field. However, widespread adoption of these methods in clinical and pre-clinical settings still remains a challenge. In this review article, we present a detailed analysis of the major obstacles encountered during the development of effective models and their deployment in practice. We aim to provide readers with an overview of the latest developments, assist them with insights into identifying some specific challenges that may require resolution, and suggest recommendations and potential future research directions. © 2023 The Authors. The Journal of Pathology published by John Wiley & Sons Ltd on behalf of The Pathological Society of Great Britain and Ireland.
Assuntos
Inteligência Artificial , Humanos , Reino UnidoRESUMO
Triple-negative breast cancer (TNBC) is known to have a relatively poor outcome with variable prognoses, raising the need for more informative risk stratification. We investigated a set of digital, artificial intelligence (AI)-based spatial tumour microenvironment (sTME) features and explored their prognostic value in TNBC. After performing tissue classification on digitised haematoxylin and eosin (H&E) slides of TNBC cases, we employed a deep learning-based algorithm to segment tissue regions into tumour, stroma, and lymphocytes in order to compute quantitative features concerning the spatial relationship of tumour with lymphocytes and stroma. The prognostic value of the digital features was explored using survival analysis with Cox proportional hazard models in a cross-validation setting on two independent international multi-centric TNBC cohorts: The Australian Breast Cancer Tissue Bank (AUBC) cohort (n = 318) and The Cancer Genome Atlas Breast Cancer (TCGA) cohort (n = 111). The proposed digital stromal tumour-infiltrating lymphocytes (Digi-sTILs) score and the digital tumour-associated stroma (Digi-TAS) score were found to carry strong prognostic value for disease-specific survival, with the Digi-sTILs and Digi-TAS scores giving C-index values of 0.65 (p = 0.0189) and 0.60 (p = 0.0437), respectively, on the TCGA cohort as a validation set. Combining the Digi-sTILs feature with the patient's positivity status for axillary lymph nodes yielded a C-index of 0.76 on unseen validation cohorts. We surmise that the proposed digital features could potentially be used for better risk stratification and management of TNBC patients. © 2023 The Authors. The Journal of Pathology published by John Wiley & Sons Ltd on behalf of The Pathological Society of Great Britain and Ireland.
Assuntos
Neoplasias de Mama Triplo Negativas , Humanos , Neoplasias de Mama Triplo Negativas/genética , Neoplasias de Mama Triplo Negativas/patologia , Linfócitos do Interstício Tumoral/patologia , Inteligência Artificial , Austrália , Prognóstico , Microambiente TumoralRESUMO
OBJECTIVE: To develop an interpretable artificial intelligence algorithm to rule out normal large bowel endoscopic biopsies, saving pathologist resources and helping with early diagnosis. DESIGN: A graph neural network was developed incorporating pathologist domain knowledge to classify 6591 whole-slides images (WSIs) of endoscopic large bowel biopsies from 3291 patients (approximately 54% female, 46% male) as normal or abnormal (non-neoplastic and neoplastic) using clinically driven interpretable features. One UK National Health Service (NHS) site was used for model training and internal validation. External validation was conducted on data from two other NHS sites and one Portuguese site. RESULTS: Model training and internal validation were performed on 5054 WSIs of 2080 patients resulting in an area under the curve-receiver operating characteristic (AUC-ROC) of 0.98 (SD=0.004) and AUC-precision-recall (PR) of 0.98 (SD=0.003). The performance of the model, named Interpretable Gland-Graphs using a Neural Aggregator (IGUANA), was consistent in testing over 1537 WSIs of 1211 patients from three independent external datasets with mean AUC-ROC=0.97 (SD=0.007) and AUC-PR=0.97 (SD=0.005). At a high sensitivity threshold of 99%, the proposed model can reduce the number of normal slides to be reviewed by a pathologist by approximately 55%. IGUANA also provides an explainable output highlighting potential abnormalities in a WSI in the form of a heatmap as well as numerical values associating the model prediction with various histological features. CONCLUSION: The model achieved consistently high accuracy showing its potential in optimising increasingly scarce pathologist resources. Explainable predictions can guide pathologists in their diagnostic decision-making and help boost their confidence in the algorithm, paving the way for its future clinical adoption.
Assuntos
Inteligência Artificial , Medicina Estatal , Humanos , Masculino , Feminino , Estudos Retrospectivos , Algoritmos , BiópsiaRESUMO
BACKGROUND: Tumour infiltrating lymphocytes (TILs) are a prognostic parameter in triple-negative and human epidermal growth factor receptor 2 (HER2)-positive breast cancer (BC). However, their role in luminal (oestrogen receptor positive and HER2 negative (ER + /HER2-)) BC remains unclear. In this study, we used artificial intelligence (AI) to assess the prognostic significance of TILs in a large well-characterised cohort of luminal BC. METHODS: Supervised deep learning model analysis of Haematoxylin and Eosin (H&E)-stained whole slide images (WSI) was applied to a cohort of 2231 luminal early-stage BC patients with long-term follow-up. Stromal TILs (sTILs) and intratumoural TILs (tTILs) were quantified and their spatial distribution within tumour tissue, as well as the proportion of stroma involved by sTILs were assessed. The association of TILs with clinicopathological parameters and patient outcome was determined. RESULTS: A strong positive linear correlation was observed between sTILs and tTILs. High sTILs and tTILs counts, as well as their proximity to stromal and tumour cells (co-occurrence) were associated with poor clinical outcomes and unfavourable clinicopathological parameters including high tumour grade, lymph node metastasis, large tumour size, and young age. AI-based assessment of the proportion of stroma composed of sTILs (as assessed visually in routine practice) was not predictive of patient outcome. tTILs was an independent predictor of worse patient outcome in multivariate Cox Regression analysis. CONCLUSION: AI-based detection of TILs counts, and their spatial distribution provides prognostic value in luminal early-stage BC patients. The utilisation of AI algorithms could provide a comprehensive assessment of TILs as a morphological variable in WSIs beyond eyeballing assessment.
Assuntos
Neoplasias da Mama , Neoplasias de Mama Triplo Negativas , Humanos , Feminino , Neoplasias da Mama/patologia , Linfócitos do Interstício Tumoral/patologia , Inteligência Artificial , Prognóstico , Neoplasias de Mama Triplo Negativas/patologia , Biomarcadores Tumorais/metabolismoRESUMO
As digital pathology replaces conventional glass slide microscopy as a means of reporting cellular pathology samples, the annotation of digital pathology whole slide images is rapidly becoming part of a pathologist's regular practice. Currently, there is no recognizable organization of these annotations, and as a result, pathologists adopt an arbitrary approach to defining regions of interest, leading to irregularity and inconsistency and limiting the downstream efficient use of this valuable effort. In this study, we propose a Standardized Annotation Reporting Style for digital whole slide images. We formed a list of 167 commonly annotated entities (under 12 specialty subcategories) based on review of Royal College of Pathologists and College of American Pathologists documents, feedback from reporting pathologists in our NHS department, and experience in developing annotation dictionaries for PathLAKE research projects. Each entity was assigned a suitable annotation shape, SNOMED CT (SNOMED International) code, and unique color. Additionally, as an example of how the approach could be expanded to specific tumor types, all lung tumors in the fifth World Health Organization of thoracic tumors 2021 were included. The proposed standardization of annotations increases their utility, making them identifiable at low power and searchable across and between cases. This would aid pathologists reporting and reviewing cases and enable annotations to be used for research. This structured approach could serve as the basis for an industry standard and be easily adopted to ensure maximum functionality and efficiency in the use of annotations made during routine clinical examination of digital slides.
Assuntos
Patologia Clínica , Patologia Cirúrgica , Neoplasias Torácicas , Humanos , Patologia Clínica/métodos , Patologia Cirúrgica/métodos , Patologistas , Microscopia/métodosRESUMO
Tumor-associated stroma in breast cancer (BC) is complex and exhibits a high degree of heterogeneity. To date, no standardized assessment method has been established. Artificial intelligence (AI) could provide an objective morphologic assessment of tumors and stroma, with the potential to identify new features not discernible by visual microscopy. In this study, we used AI to assess the clinical significance of (1) stroma-to-tumor ratio (S:TR) and (2) the spatial arrangement of stromal cells, tumor cell density, and tumor burden in BC. Whole-slide images of a large cohort (n = 1968) of well-characterized luminal BC cases were examined. Region and cell-level annotation was performed, and supervised deep learning models were applied for automated quantification of tumor and stromal features. S:TR was calculated in terms of surface area and cell count ratio, and the S:TR heterogeneity and spatial distribution were also assessed. Tumor cell density and tumor size were used to estimate tumor burden. Cases were divided into discovery (n = 1027) and test (n = 941) sets for validation of the findings. In the whole cohort, the stroma-to-tumor mean surface area ratio was 0.74, and stromal cell density heterogeneity score was high (0.7/1). BC with high S:TR showed features characteristic of good prognosis and longer patient survival in both the discovery and test sets. Heterogeneous spatial distribution of S:TR areas was predictive of worse outcome. Higher tumor burden was associated with aggressive tumor behavior and shorter survival and was an independent predictor of worse outcome (BC-specific survival; hazard ratio: 1.7, P = .03, 95% CI, 1.04-2.83 and distant metastasis-free survival; hazard ratio: 1.64, P = .04, 95% CI, 1.01-2.62) superior to absolute tumor size. The study concludes that AI provides a tool to assess major and subtle morphologic stromal features in BC with prognostic implications. Tumor burden is more prognostically informative than tumor size.
RESUMO
MOTIVATION: Digitization of pathology laboratories through digital slide scanners and advances in deep learning approaches for objective histological assessment have resulted in rapid progress in the field of computational pathology (CPath) with wide-ranging applications in medical and pharmaceutical research as well as clinical workflows. However, the estimation of robustness of CPath models to variations in input images is an open problem with a significant impact on the downstream practical applicability, deployment and acceptability of these approaches. Furthermore, development of domain-specific strategies for enhancement of robustness of such models is of prime importance as well. RESULTS: In this work, we propose the first domain-specific Robustness Evaluation and Enhancement Toolbox (REET) for computational pathology applications. It provides a suite of algorithmic strategies for enabling robustness assessment of predictive models with respect to specialized image transformations such as staining, compression, focusing, blurring, changes in spatial resolution, brightness variations, geometric changes as well as pixel-level adversarial perturbations. Furthermore, REET also enables efficient and robust training of deep learning pipelines in computational pathology. Python implementation of REET is available at https://github.com/alexjfoote/reetoolbox. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Biologia Computacional , SoftwareRESUMO
MOTIVATION: Machine-learning-based prediction of compound-protein interactions (CPIs) is important for drug design, screening and repurposing. Despite numerous recent publication with increasing methodological sophistication claiming consistent improvements in predictive accuracy, we have observed a number of fundamental issues in experiment design that produce overoptimistic estimates of model performance. RESULTS: We systematically analyze the impact of several factors affecting generalization performance of CPI predictors that are overlooked in existing work: (i) similarity between training and test examples in cross-validation; (ii) synthesizing negative examples in absence of experimentally verified negative examples and (iii) alignment of evaluation protocol and performance metrics with real-world use of CPI predictors in screening large compound libraries. Using both state-of-the-art approaches by other researchers as well as a simple kernel-based baseline, we have found that effective assessment of generalization performance of CPI predictors requires careful control over similarity between training and test examples. We show that, under stringent performance assessment protocols, a simple kernel-based approach can exceed the predictive performance of existing state-of-the-art methods. We also show that random pairing for generating synthetic negative examples for training and performance evaluation results in models with better generalization in comparison to more sophisticated strategies used in existing studies. Our analyses indicate that using proposed experiment design strategies can offer significant improvements for CPI prediction leading to effective target compound screening for drug repurposing and discovery of putative chemical ligands of SARS-CoV-2-Spike and Human-ACE2 proteins. AVAILABILITY AND IMPLEMENTATION: Code and supplementary material available at https://github.com/adibayaseen/HKRCPI. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Enzima de Conversão de Angiotensina 2 , Aprendizado de Máquina , Humanos , Ligantes , SARS-CoV-2RESUMO
BACKGROUND: Mitotic count in breast cancer is an important prognostic marker. Unfortunately, substantial inter- and intraobserver variation exists when pathologists manually count mitotic figures. To alleviate this problem, we developed a new technique incorporating both haematoxylin and eosin (H&E) and phosphorylated histone H3 (PHH3), a marker highly specific to mitotic figures, and compared it to visual scoring of mitotic figures using H&E only. METHODS: Two full-face sections from 97 cases were cut, one stained with H&E only, and the other was stained with PHH3 and counterstained with H&E (PHH3-H&E). Counting mitoses using PHH3-H&E was compared to traditional mitoses scoring using H&E in terms of reproducibility, scoring time, and the ability to detect mitosis hotspots. We assessed the agreement between manual and image analysis-assisted scoring of mitotic figures using H&E and PHH3-H&E-stained cells. The diagnostic performance of PHH3 in detecting mitotic figures in terms of sensitivity and specificity was measured. Finally, PHH3 replaced the mitosis score in a multivariate analysis to assess its significance. RESULTS: Pathologists detected significantly higher mitotic figures using the PHH3-H&E (median ± SD, 20 ± 33) compared with H&E alone (median ± SD, 16 ± 25), P < 0.001. The concordance between pathologists in identifying mitotic figures was highest when using the dual PHH3-H&E technique; in addition, it highlighted mitotic figures at low power, allowing better agreement on choosing the hotspot area (k = 0.842) in comparison with standard H&E (k = 0.625). A better agreement between image analysis-assisted software and the human eye was observed for PHH3-stained mitotic figures. When the mitosis score was replaced with PHH3 in a Cox regression model with other grade components, PHH3 was an independent predictor of survival (hazard ratio [HR] 5.66, 95% confidence interval [CI] 1.92-16.69; P = 0.002), and even showed a more significant association with breast cancer-specific survival (BCSS) than mitosis (HR 3.63, 95% CI 1.49-8.86; P = 0.005) and Ki67 (P = 0.27). CONCLUSION: Using PHH3-H&E-stained slides can reliably be used in routine scoring of mitotic figures and integrating both techniques will compensate for each other's limitations and improve diagnostic accuracy, quality, and precision.
Assuntos
Neoplasias da Mama , Humanos , Feminino , Amarelo de Eosina-(YS) , Índice Mitótico/métodos , Neoplasias da Mama/diagnóstico , Hematoxilina , Reprodutibilidade dos Testes , Biomarcadores Tumorais/análise , Imuno-Histoquímica , Mitose , Anticorpos , FosforilaçãoRESUMO
CRISPR-Cas is an anti-viral mechanism of prokaryotes that has been widely adopted for genome editing. To make CRISPR-Cas genome editing more controllable and safer to use, anti-CRISPR proteins have been recently exploited to prevent excessive/prolonged Cas nuclease cleavage. Anti-CRISPR (Acr) proteins are encoded by (pro)phages/(pro)viruses, and have the ability to inhibit their host's CRISPR-Cas systems. We have built an online database AcrDB (http://bcb.unl.edu/AcrDB) by scanning â¼19 000 genomes of prokaryotes and viruses with AcrFinder, a recently developed Acr-Aca (Acr-associated regulator) operon prediction program. Proteins in Acr-Aca operons were further processed by two machine learning-based programs (AcRanker and PaCRISPR) to obtain numerical scores/ranks. Compared to other anti-CRISPR databases, AcrDB has the following unique features: (i) It is a genome-scale database with the largest collection of data (39 799 Acr-Aca operons containing Aca or Acr homologs); (ii) It offers a user-friendly web interface with various functions for browsing, graphically viewing, searching, and batch downloading Acr-Aca operons; (iii) It focuses on the genomic context of Acr and Aca candidates instead of individual Acr protein family and (iv) It collects data with three independent programs each having a unique data mining algorithm for cross validation. AcrDB will be a valuable resource to the anti-CRISPR research community.
Assuntos
Sistemas CRISPR-Cas/genética , Bases de Dados Genéticas , Óperon/genética , Células Procarióticas/metabolismo , Vírus/metabolismo , InternetRESUMO
The increasing use of CRISPR-Cas9 in medicine, agriculture, and synthetic biology has accelerated the drive to discover new CRISPR-Cas inhibitors as potential mechanisms of control for gene editing applications. Many anti-CRISPRs have been found that inhibit the CRISPR-Cas adaptive immune system. However, comparing all currently known anti-CRISPRs does not reveal a shared set of properties for facile bioinformatic identification of new anti-CRISPR families. Here, we describe AcRanker, a machine learning based method to aid direct identification of new potential anti-CRISPRs using only protein sequence information. Using a training set of known anti-CRISPRs, we built a model based on XGBoost ranking. We then applied AcRanker to predict candidate anti-CRISPRs from predicted prophage regions within self-targeting bacterial genomes and discovered two previously unknown anti-CRISPRs: AcrllA20 (ML1) and AcrIIA21 (ML8). We show that AcrIIA20 strongly inhibits Streptococcus iniae Cas9 (SinCas9) and weakly inhibits Streptococcus pyogenes Cas9 (SpyCas9). We also show that AcrIIA21 inhibits SpyCas9, Streptococcus aureus Cas9 (SauCas9) and SinCas9 with low potency. The addition of AcRanker to the anti-CRISPR discovery toolkit allows researchers to directly rank potential anti-CRISPR candidate genes for increased speed in testing and validation of new anti-CRISPRs. A web server implementation for AcRanker is available online at http://acranker.pythonanywhere.com/.
Assuntos
Proteínas de Bactérias/genética , Proteína 9 Associada à CRISPR/antagonistas & inibidores , Aprendizado de Máquina , Proteínas de Bactérias/química , Prófagos/genética , Proteoma , Análise de Sequência de Proteína , Streptococcus/enzimologia , Streptococcus/genéticaRESUMO
Urine cytology is a test for the detection of high-grade bladder cancer. In clinical practice, the pathologist would manually scan the sample under the microscope to locate atypical and malignant cells. They would assess the morphology of these cells to make a diagnosis. Accurate identification of atypical and malignant cells in urine cytology is a challenging task and is an essential part of identifying different diagnosis with low-risk and high-risk malignancy. Computer-assisted identification of malignancy in urine cytology can be complementary to the clinicians for treatment management and in providing advice for carrying out further tests. In this study, we presented a method for identifying atypical and malignant cells followed by their profiling to predict the risk of diagnosis automatically. For cell detection and classification, we employed two different deep learning-based approaches. Based on the best performing network predictions at the cell level, we identified low-risk and high-risk cases using the count of atypical cells and the total count of atypical and malignant cells. The area under the receiver operating characteristic (ROC) curve shows that a total count of atypical and malignant cells is comparably better at diagnosis as compared to the count of malignant cells only. We obtained area under the ROC curve with the count of malignant cells and the total count of atypical and malignant cells as 0.81 and 0.83, respectively. Our experiments also demonstrate that the digital risk could be a better predictor of the final histopathology-based diagnosis. We also analyzed the variability in annotations at both cell and whole slide image level and also explored the possible inherent rationales behind this variability.
Assuntos
Aprendizado Profundo , Citodiagnóstico , Curva ROC , Medição de RiscoRESUMO
AIMS: Tumour genotype and phenotype are related and can predict outcome. In this study, we hypothesised that the visual assessment of breast cancer (BC) morphological features can provide valuable insight into underlying molecular profiles. METHODS AND RESULTS: The Cancer Genome Atlas (TCGA) BC cohort was used (n = 743) and morphological features, including Nottingham grade and its components and nucleolar prominence, were assessed utilising whole-slide images (WSIs). Two independent scores were assigned, and discordant cases were utilised to represent cases with intermediate morphological features. Differentially expressed genes (DEGs) were identified for each feature, compared among concordant/discordant cases and tested for specific pathways. Concordant grading was observed in 467 of 743 (63%) of cases. Among concordant case groups, eight common DEGs (UGT8, DDC, RGR, RLBP1, SPRR1B, CXorf49B, PSAPL1 and SPRR2G) were associated with overall tumour grade and its components. These genes are related mainly to cellular proliferation, differentiation and metabolism. The number of DEGs in cases with discordant grading was larger than those identified in concordant cases. The largest number of DEGs was observed in discordant grade 1:3 cases (n = 1185). DEGs were identified for each discordant component. Some DEGs were uniquely associated with well-defined specific morphological features, whereas expression/co-expression of other genes was identified across multiple features and underlined intermediate morphological features. CONCLUSION: Morphological features are probably related to distinct underlying molecular profiles that drive both morphology and behaviour. This study provides further evidence to support the use of image-based analysis of WSIs, including artificial intelligence algorithms, to predict tumour molecular profiles and outcome.
Assuntos
Neoplasias da Mama/genética , Neoplasias da Mama/patologia , Citodiagnóstico/métodos , Feminino , Perfilação da Expressão Gênica/métodos , Humanos , TranscriptomaRESUMO
BACKGROUND: Determining protein-protein interactions and their binding affinity are important in understanding cellular biological processes, discovery and design of novel therapeutics, protein engineering, and mutagenesis studies. Due to the time and effort required in wet lab experiments, computational prediction of binding affinity from sequence or structure is an important area of research. Structure-based methods, though more accurate than sequence-based techniques, are limited in their applicability due to limited availability of protein structure data. RESULTS: In this study, we propose a novel machine learning method for predicting binding affinity that uses protein 3D structure as privileged information at training time while expecting only protein sequence information during testing. Using the method, which is based on the framework of learning using privileged information (LUPI), we have achieved improved performance over corresponding sequence-based binding affinity prediction methods that do not have access to privileged information during training. Our experiments show that with the proposed framework which uses structure only during training, it is possible to achieve classification performance comparable to that which is obtained using structure-based features. Evaluation on an independent test set shows improved performance over the PPA-Pred2 method as well. CONCLUSIONS: The proposed method outperforms several baseline learners and a state-of-the-art binding affinity predictor not only in cross-validation, but also on an additional validation dataset, demonstrating the utility of the LUPI framework for problems that would benefit from classification using structure-based features. The implementation of LUPI developed for this work is expected to be useful in other areas of bioinformatics as well.
Assuntos
Algoritmos , Biologia Computacional/métodos , Aprendizado de Máquina , Proteínas/metabolismo , Sequência de Aminoácidos , Ligantes , Ligação Proteica , Proteínas/química , Curva ROC , Reprodutibilidade dos Testes , Máquina de Vetores de SuporteRESUMO
Many prion-forming proteins contain glutamine/asparagine (Q/N) rich domains, and there are conflicting opinions as to the role of primary sequence in their conversion to the prion form: is this phenomenon driven primarily by amino acid composition, or, as a recent computational analysis suggested, dependent on the presence of short sequence elements with high amyloid-forming potential. The argument for the importance of short sequence elements hinged on the relatively-high accuracy obtained using a method that utilizes a collection of length-six sequence elements with known amyloid-forming potential. We weigh in on this question and demonstrate that when those sequence elements are permuted, even higher accuracy is obtained; we also propose a novel multiple-instance machine learning method that uses sequence composition alone, and achieves better accuracy than all existing prion prediction approaches. While we expect there to be elements of primary sequence that affect the process, our experiments suggest that sequence composition alone is sufficient for predicting protein sequences that are likely to form prions. A web-server for the proposed method is available at http://faculty.pieas.edu.pk/fayyaz/prank.html, and the code for reproducing our experiments is available at http://doi.org/10.5281/zenodo.167136.
Assuntos
Sequência de Aminoácidos , Asparagina/química , Biologia Computacional/métodos , Glutamina/química , Aprendizado de Máquina , Príons/química , Amiloide/química , Humanos , Príons/metabolismo , LevedurasRESUMO
Due to Ca2+ -dependent binding and the sequence diversity of Calmodulin (CaM) binding proteins, identifying CaM interactions and binding sites in the wet-lab is tedious and costly. Therefore, computational methods for this purpose are crucial to the design of such wet-lab experiments. We present an algorithm suite called CaMELS (CalModulin intEraction Learning System) for predicting proteins that interact with CaM as well as their binding sites using sequence information alone. CaMELS offers state of the art accuracy for both CaM interaction and binding site prediction and can aid biologists in studying CaM binding proteins. For CaM interaction prediction, CaMELS uses protein sequence features coupled with a large-margin classifier. CaMELS models the binding site prediction problem using multiple instance machine learning with a custom optimization algorithm which allows more effective learning over imprecisely annotated CaM-binding sites during training. CaMELS has been extensively benchmarked using a variety of data sets, mutagenic studies, proteome-wide Gene Ontology enrichment analyses and protein structures. Our experiments indicate that CaMELS outperforms simple motif-based search and other existing methods for interaction and binding site prediction. We have also found that the whole sequence of a protein, rather than just its binding site, is important for predicting its interaction with CaM. Using the machine learning model in CaMELS, we have identified important features of protein sequences for CaM interaction prediction as well as characteristic amino acid sub-sequences and their relative position for identifying CaM binding sites. Python code for training and evaluating CaMELS together with a webserver implementation is available at the URL: http://faculty.pieas.edu.pk/fayyaz/software.html#camels.
Assuntos
Proteínas de Ligação a Calmodulina/química , Calmodulina/química , Proteoma/genética , Software , Algoritmos , Sequência de Aminoácidos , Sítios de Ligação , Proteínas de Ligação a Calmodulina/genética , Simulação por Computador , Ligação Proteica , Proteoma/químicaRESUMO
Nuclei detection in histology images is an essential part of computer aided diagnosis of cancers and tumors. It is a challenging task due to diverse and complicated structures of cells. In this work, we present an automated technique for detection of cellular nuclei in hematoxylin and eosin stained histopathology images. Our proposed approach is based on kernelized correlation filters. Correlation filters have been widely used in object detection and tracking applications but their strength has not been explored in the medical imaging domain up till now. Our experimental results show that the proposed scheme gives state of the art accuracy and can learn complex nuclear morphologies. Like deep learning approaches, the proposed filters do not require engineering of image features as they can operate directly on histopathology images without significant preprocessing. However, unlike deep learning methods, the large-margin correlation filters developed in this work are interpretable, computationally efficient and do not require specialized or expensive computing hardware. AVAILABILITY: A cloud based webserver of the proposed method and its python implementation can be accessed at the following URL: http://faculty.pieas.edu.pk/fayyaz/software.html#corehist .
Assuntos
Núcleo Celular/patologia , Interpretação de Imagem Assistida por Computador/métodos , Aprendizado de Máquina , Análise de Fourier , HumanosRESUMO
We present a novel partner-specific protein-protein interaction site prediction method called PAIRpred. Unlike most existing machine learning binding site prediction methods, PAIRpred uses information from both proteins in a protein complex to predict pairs of interacting residues from the two proteins. PAIRpred captures sequence and structure information about residue pairs through pairwise kernels that are used for training a support vector machine classifier. As a result, PAIRpred presents a more detailed model of protein binding, and offers state of the art accuracy in predicting binding sites at the protein level as well as inter-protein residue contacts at the complex level. We demonstrate PAIRpred's performance on Docking Benchmark 4.0 and recent CAPRI targets. We present a detailed performance analysis outlining the contribution of different sequence and structure features, together with a comparison to a variety of existing interface prediction techniques. We have also studied the impact of binding-associated conformational change on prediction accuracy and found PAIRpred to be more robust to such structural changes than existing schemes. As an illustration of the potential applications of PAIRpred, we provide a case study in which PAIRpred is used to analyze the nature and specificity of the interface in the interaction of human ISG15 protein with NS1 protein from influenza A virus. Python code for PAIRpred is available at http://combi.cs.colostate.edu/supplements/pairpred/.