RESUMO
The improving access to increasing amounts of biomedical data provides completely new chances for advanced patient stratification and disease subtyping strategies. This requires computational tools that produce uniformly robust results across highly heterogeneous molecular data. Unsupervised machine learning methodologies are able to discover de novo patterns in such data. Biclustering is especially suited by simultaneously identifying sample groups and corresponding feature sets across heterogeneous omics data. The performance of available biclustering algorithms heavily depends on individual parameterization and varies with their application. Here, we developed MoSBi (molecular signature identification using biclustering), an automated multialgorithm ensemble approach that integrates results utilizing an error model-supported similarity network. We systematically evaluated the performance of 11 available and established biclustering algorithms together with MoSBi. For this, we used transcriptomics, proteomics, and metabolomics data, as well as synthetic datasets covering various data properties. Profiting from multialgorithm integration, MoSBi identified robust group and disease-specific signatures across all scenarios, overcoming single algorithm specificities. Furthermore, we developed a scalable network-based visualization of bicluster communities that supports biological hypothesis generation. MoSBi is available as an R package and web service to make automated biclustering analysis accessible for application in molecular sample stratification.
Assuntos
Doença , Perfilação da Expressão Gênica , Metabolômica , Pacientes , Proteômica , Software , Algoritmos , Análise por Conglomerados , Doença/classificação , Humanos , Pacientes/classificaçãoRESUMO
BACKGROUND: Artificial intelligence models constitute specific uses of analysis results and, therefore, necessitate evaluation of analytical performance specifications (APS) for this context specifically. The Model of End-stage Liver Disease (MELD) is a clinical prediction model based on measurements of bilirubin, creatinine, and the international normalized ratio (INR). This study evaluates the propagation of error through the MELD, to inform choice of APS for the MELD input variables. METHODS: A total of 6093 consecutive MELD scores and underlying analysis results were retrospectively collected. "Desirable analytical variation" based on biological variation as well as current local analytical variation was simulated onto the data set as well as onto a constructed data set, representing a worst-case scenario. Resulting changes in MELD score and risk classification were calculated. RESULTS: Biological variation-based APS in the worst-case scenario resulted in 3.26% of scores changing by ≥1 MELD point. In the patient-derived data set, the same variation resulted in 0.92% of samples changing by ≥1 MELD point, and 5.5% of samples changing risk category. Local analytical performance resulted in lower reclassification rates. CONCLUSIONS: Error propagation through MELD is complex and includes population-dependent mechanisms. Biological variation-derived APS were acceptable for all uses of the MELD score. Other combinations of APS can yield equally acceptable results. This analysis exemplifies how error propagation through artificial intelligence models can become highly complex. This complexity will necessitate that both model suppliers and clinical laboratories address analytical performance specifications for the specific use case, as these may differ from performance specifications for traditional use of the analyses.
Assuntos
Doença Hepática Terminal , Humanos , Estudos Retrospectivos , Inteligência Artificial , Modelos Estatísticos , Prognóstico , Índice de Gravidade de Doença , CreatininaRESUMO
MOTIVATION: Liquid-chromatography mass-spectrometry (LC-MS) is the established standard for analyzing the proteome in biological samples by identification and quantification of thousands of proteins. Machine learning (ML) promises to considerably improve the analysis of the resulting data, however, there is yet to be any tool that mediates the path from raw data to modern ML applications. More specifically, ML applications are currently hampered by three major limitations: (i) absence of balanced training data with large sample size; (ii) unclear definition of sufficiently information-rich data representations for e.g. peptide identification; (iii) lack of benchmarking of ML methods on specific LC-MS problems. RESULTS: We created the MS2AI pipeline that automates the process of gathering vast quantities of MS data for large-scale ML applications. The software retrieves raw data from either in-house sources or from the proteomics identifications database, PRIDE. Subsequently, the raw data are stored in a standardized format amenable for ML, encompassing MS1/MS2 spectra and peptide identifications. This tool bridges the gap between MS and AI, and to this effect we also present an ML application in the form of a convolutional neural network for the identification of oxidized peptides. AVAILABILITY AND IMPLEMENTATION: An open-source implementation of the software can be found at https://gitlab.com/roettgerlab/ms2ai. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Peptídeos , Espectrometria de Massas em Tandem , Cromatografia Líquida/métodos , Espectrometria de Massas em Tandem/métodos , Peptídeos/análise , Software , Proteoma/químicaRESUMO
OBJECTIVES: To compare plasma levels of 92 cardiovascular- and inflammation-related proteins (CIRPs) and to analyse for associations with anti-cyclic citrullinated peptide (anti-CCP) status and disease activity in early and treatment-naive rheumatoid arthritis (RA). METHODS: Olink CVD-III-panel was used to measure 92 CIRP plasma levels in 180 early, treatment-naive, and highly inflamed RA patients from the OPERA trial. CIRP plasma levels as well as correlation between CIRP plasma levels and RA disease activity were compared between anti-CCP groups. CIRP level-based hierarchical cluster analysis was performed in each anti-CCP group separately. RESULTS: The study included 117 anti-CCP-positive and 63 anti-CCP-negative RA patients. Among the 92 CIRPs measured, the levels of chitotriosidase-1 (CHIT1) and tyrosine-protein-phosphatase non-receptor-type substrate-1 (SHPS-1) were increased and those of metalloproteinase inhibitor-4 (TIMP-4) decreased in the anti-CCP-negative group compared to anti-CCP-positive group. The strongest associations with RA disease activity were found for interleukin-2 receptor-subunit-alpha (IL2-RA) and E-selectin levels in the anti-CCP-negative group and for C-C-motif chemokine-16 levels (CCL16) in the anti-CCP-positive group. None of the differences passed the Hochberg sequential multiplicity test, however, the CIPRs were interacting and thus the prerequisites of the Hochberg procedure were not fulfilled. CIRP level-based cluster analysis identified two patient clusters in both anti-CCP groups. Demographic and clinical characteristics were similar in the two clusters for each anti-CCP group. CONCLUSIONS: In active and early RA, the findings regarding CHIT1, SHPS-1 TIMP-4, IL2-RA, E-selectin, and CCL16 differed between the two anti-CCP groups. In addition, we identified two patient clusters that were independent of the anti-CCP status.
Assuntos
Artrite Reumatoide , Selectina E , Humanos , Anticorpos Antiproteína Citrulinada , Interleucina-2 , Autoanticorpos , Artrite Reumatoide/diagnóstico , Inflamação , Peptídeos CíclicosRESUMO
BACKGROUND: Machine learning and artificial intelligence have shown promising results in many areas and are driven by the increasing amount of available data. However, these data are often distributed across different institutions and cannot be easily shared owing to strict privacy regulations. Federated learning (FL) allows the training of distributed machine learning models without sharing sensitive data. In addition, the implementation is time-consuming and requires advanced programming skills and complex technical infrastructures. OBJECTIVE: Various tools and frameworks have been developed to simplify the development of FL algorithms and provide the necessary technical infrastructure. Although there are many high-quality frameworks, most focus only on a single application case or method. To our knowledge, there are no generic frameworks, meaning that the existing solutions are restricted to a particular type of algorithm or application field. Furthermore, most of these frameworks provide an application programming interface that needs programming knowledge. There is no collection of ready-to-use FL algorithms that are extendable and allow users (eg, researchers) without programming knowledge to apply FL. A central FL platform for both FL algorithm developers and users does not exist. This study aimed to address this gap and make FL available to everyone by developing FeatureCloud, an all-in-one platform for FL in biomedicine and beyond. METHODS: The FeatureCloud platform consists of 3 main components: a global frontend, a global backend, and a local controller. Our platform uses a Docker to separate the local acting components of the platform from the sensitive data systems. We evaluated our platform using 4 different algorithms on 5 data sets for both accuracy and runtime. RESULTS: FeatureCloud removes the complexity of distributed systems for developers and end users by providing a comprehensive platform for executing multi-institutional FL analyses and implementing FL algorithms. Through its integrated artificial intelligence store, federated algorithms can easily be published and reused by the community. To secure sensitive raw data, FeatureCloud supports privacy-enhancing technologies to secure the shared local models and assures high standards in data privacy to comply with the strict General Data Protection Regulation. Our evaluation shows that applications developed in FeatureCloud can produce highly similar results compared with centralized approaches and scale well for an increasing number of participating sites. CONCLUSIONS: FeatureCloud provides a ready-to-use platform that integrates the development and execution of FL algorithms while reducing the complexity to a minimum and removing the hurdles of federated infrastructure. Thus, we believe that it has the potential to greatly increase the accessibility of privacy-preserving and distributed data analyses in biomedicine and beyond.
Assuntos
Algoritmos , Inteligência Artificial , Humanos , Ocupações em Saúde , Software , Redes de Comunicação de Computadores , PrivacidadeRESUMO
BACKGROUND: Human endogenous retrovirus (HERV) expression in multiple sclerosis (MS) brain lesions may contribute to chronic inflammation, but expression of genome-wide HERVs in different MS lesions is unknown. OBJECTIVE: We examined the HERV expression landscape in different MS lesions compared to control brains. METHODS: Transcripts from 71 MS brain samples and 25 control WM were obtained by next-generation RNA sequencing and mapped against HERV transcripts across the human genome. Differential expression of mapped HERV-W and HERV-H reads between MS lesion types and controls was analysed. RESULTS: Out of 6.38 billion high-quality paired end reads, 174 million reads (2.73%) mapped to HERV transcripts. There was no difference in HERVs expression level between MS and control brains, but HERV-W transcripts were significantly reduced in chronic active lesions. Of the four HERV-W transcripts exclusively present in MS, ERV3633503 located on chromosome 7q21.13 close to the MS genetic risk locus had the highest number of reads. In the HERV-H family, 75% of transcripts located to nearby 7q21-22 were overrepresented in MS, and ERV3643914 was expressed more than 16 times in MS compared to control brains. CONCLUSION: Novel HERV-W and HERV-H transcripts located at chromosome 7 regions were uniquely expressed in MS lesions, indicating their potential role in brain lesion evolution.
Assuntos
Retrovirus Endógenos , Esclerose Múltipla , Encéfalo , Retrovirus Endógenos/genética , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Esclerose Múltipla/genéticaRESUMO
Gene regulatory networks (GRNs) and gene expression data form a core element of systems biology-based phenotyping. Changes in the expression of transcription factors are commonly believed to have a causal effect on the expression of their targets. Here we evaluated in the best researched model organism, Escherichia coli, the consistency between a GRN and a large gene expression compendium. Surprisingly, a modest correlation was observed between the expression of transcription factors and their targets and, most noteworthy, both activating and repressing interactions were associated with positive correlation. When evaluated using a sign consistency model we found the regulatory network was not more consistent with measured expression than random network models. We conclude that, at least in E. coli, one cannot expect a causal relationship between the expression of transcription and factors their targets, and that the current static GRN does not adequately explain transcriptional regulation. The implications of this are profound as they question what we consider established knowledge of the systemic biology of cells and point to methodological limitations with respect to single omics analysis, static networks and temporality.
Assuntos
Escherichia coli/genética , Redes Reguladoras de Genes/genética , Modelos Teóricos , Algoritmos , Regulação Bacteriana da Expressão Gênica/genética , Biologia de Sistemas/tendênciasRESUMO
[This corrects the article DOI: 10.2196/28253.].
RESUMO
BACKGROUND: Before the advent of an effective vaccine, nonpharmaceutical interventions, such as mask-wearing, social distancing, and lockdowns, have been the primary measures to combat the COVID-19 pandemic. Such measures are highly effective when there is high population-wide adherence, which requires information on current risks posed by the pandemic alongside a clear exposition of the rules and guidelines in place. OBJECTIVE: Here we analyzed online news media coverage of COVID-19. We quantified the total volume of COVID-19 articles, their sentiment polarization, and leading subtopics to act as a reference to inform future communication strategies. METHODS: We collected 26 million news articles from the front pages of 172 major online news sources in 11 countries (available online at SciRide). Using topic detection, we identified COVID-19-related content to quantify the proportion of total coverage the pandemic received in 2020. The sentiment analysis tool Vader was employed to stratify the emotional polarity of COVID-19 reporting. Further topic detection and sentiment analysis was performed on COVID-19 coverage to reveal the leading themes in pandemic reporting and their respective emotional polarizations. RESULTS: We found that COVID-19 coverage accounted for approximately 25.3% of all front-page online news articles between January and October 2020. Sentiment analysis of English-language sources revealed that overall COVID-19 coverage was not exclusively negatively polarized, suggesting wide heterogeneous reporting of the pandemic. Within this heterogenous coverage, 16% of COVID-19 news articles (or 4% of all English-language articles) can be classified as highly negatively polarized, citing issues such as death, fear, or crisis. CONCLUSIONS: The goal of COVID-19 public health communication is to increase understanding of distancing rules and to maximize the impact of governmental policy. The extent to which the quantity and quality of information from different communication channels (eg, social media, government pages, and news) influence public understanding of public health measures remains to be established. Here we conclude that a quarter of all reporting in 2020 covered COVID-19, which is indicative of information overload. In this capacity, our data and analysis form a quantitative basis for informing health communication strategies along traditional news media channels to minimize the risks of COVID-19 while vaccination is rolled out.
Assuntos
COVID-19/epidemiologia , Mineração de Dados/métodos , Meios de Comunicação de Massa/estatística & dados numéricos , Saúde Pública/métodos , Mídias Sociais/estatística & dados numéricos , Recursos em Saúde , Humanos , Pandemias , SARS-CoV-2/isolamento & purificaçãoRESUMO
Identifying groups of similar objects is a popular first step in biomedical data analysis, but it is error-prone and impossible to perform manually. Many computational methods have been developed to tackle this problem. Here we assessed 13 well-known methods using 24 data sets ranging from gene expression to protein domains. Performance was judged on the basis of 13 common cluster validity indices. We developed a clustering analysis platform, ClustEval (http://clusteval.mpi-inf.mpg.de), to promote streamlined evaluation, comparison and reproducibility of clustering results in the future. This allowed us to objectively evaluate the performance of all tools on all data sets with up to 1,000 different parameter sets each, resulting in a total of more than 4 million calculated cluster validity indices. We observed that there was no universal best performer, but on the basis of this wide-ranging comparison we were able to develop a short guideline for biomedical clustering tasks. ClustEval allows biomedical researchers to pick the appropriate tool for their data type and allows method developers to compare their tool to the state of the art.
Assuntos
Análise por Conglomerados , Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Reconhecimento Automatizado de Padrão/métodos , Algoritmos , Animais , Automação , Regulação da Expressão Gênica , Humanos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Estrutura Terciária de Proteína , Controle de Qualidade , Reprodutibilidade dos Testes , SoftwareRESUMO
Motivation: Epigenome-wide association studies (EWAS) generate big epidemiological datasets. They aim for detecting differentially methylated DNA regions that are likely to influence transcriptional gene activity and, thus, the regulation of metabolic processes. The by far most widely used technology is the Illumina Methylation BeadChip, which measures the methylation levels of 450 (850) thousand cytosines, in the CpG dinucleotide context in a set of patients compared to a control group. Many bioinformatics tools exist for raw data analysis. However, most of them require some knowledge in the programming language R, have no user interface, and do not offer all necessary steps to guide users from raw data all the way down to statistically significant differentially methylated regions (DMRs) and the associated genes. Results: Here, we present DiMmeR (Discovery of Multiple Differentially Methylated Regions), the first free standalone software that interactively guides with a user-friendly graphical user interface (GUI) scientists the whole way through EWAS data analysis. It offers parallelized statistical methods for efficiently identifying DMRs in both Illumina 450K and 850K EPIC chip data. DiMmeR computes empirical P -values through randomization tests, even for big datasets of hundreds of patients and thousands of permutations within a few minutes on a standard desktop PC. It is independent of any third-party libraries, computes regression coefficients, P -values and empirical P -values, and it corrects for multiple testing. Availability and Implementation: DiMmeR is publicly available at http://dimmer.compbio.sdu.dk . Contact: diogoma@bmb.sdu.dk. Supplementary information: Supplementary data are available at Bioinformatics online.
Assuntos
Ilhas de CpG , Metilação de DNA , Epigenômica/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Software , HumanosRESUMO
Poor nutrition during critical growth phases may alter the structural and physiologic development of vital organs thus "programming" the susceptibility to adult-onset diseases and disease-related health conditions. Epigenome-wide association studies have been performed in birth-weight discordant twin pairs to find evidence for such "programming" effects, but no significant results emerged. We further investigated this issue using a new computational approach: Instead of probing single genomic sites for significant alterations in epigenetic marks, we scan for differentially methylated genomic regions. Whole genome DNA methylation levels were measured in whole blood from 150 pairs of adult identical twins discordant for birth-weight. Intrapair differential DNA methylation was associated with qualitative (large or small) and quantitative (percentage) birth-weight discordance at each genomic site using regression models adjusting for age and sex. Based on the regression results, genomic regions with consistent alteration patterns of DNA methylation were located and tested for significant robustness using computational permutation tests. This yielded an interesting genomic region on chromosome 1, which is significantly differentially methylated for quantitative birth-weight discordance. The region covers two genes (TYW3 and CRYZ) both reportedly associated with metabolism. We conclude that prenatal conditions for birth-weight discordance may result in persistent epigenetic modifications potentially affecting even adult health.
Assuntos
Peso ao Nascer , Metilação de DNA , Epigênese Genética , Adulto , Idoso , Feminino , Genoma Humano , Genômica , Humanos , Modelos Lineares , Masculino , Pessoa de Meia-Idade , Gêmeos MonozigóticosRESUMO
The explosion of the biological data has dramatically reformed today's biological research. The need to integrate and analyze high-dimensional biological data on a large scale is driving the development of novel bioinformatics approaches. Biclustering, also known as 'simultaneous clustering' or 'co-clustering', has been successfully utilized to discover local patterns in gene expression data and similar biomedical data types. Here, we contribute a new heuristic: 'Bi-Force'. It is based on the weighted bicluster editing model, to perform biclustering on arbitrary sets of biological entities, given any kind of pairwise similarities. We first evaluated the power of Bi-Force to solve dedicated bicluster editing problems by comparing Bi-Force with two existing algorithms in the BiCluE software package. We then followed a biclustering evaluation protocol in a recent review paper from Eren et al. (2013) (A comparative analysis of biclustering algorithms for gene expressiondata. Brief. Bioinform., 14:279-292.) and compared Bi-Force against eight existing tools: FABIA, QUBIC, Cheng and Church, Plaid, BiMax, Spectral, xMOTIFs and ISA. To this end, a suite of synthetic datasets as well as nine large gene expression datasets from Gene Expression Omnibus were analyzed. All resulting biclusters were subsequently investigated by Gene Ontology enrichment analysis to evaluate their biological relevance. The distinct theoretical foundation of Bi-Force (bicluster editing) is more powerful than strict biclustering. We thus outperformed existing tools with Bi-Force at least when following the evaluation protocols from Eren et al. Bi-Force is implemented in Java and integrated into the open source software package of BiCluE. The software as well as all used datasets are publicly available at http://biclue.mpi-inf.mpg.de.
Assuntos
Perfilação da Expressão Gênica , Algoritmos , Animais , Análise por Conglomerados , Simulação por Computador , Bases de Dados Genéticas , Ontologia Genética , Humanos , Modelos Genéticos , Análise de Sequência com Séries de Oligonucleotídeos , Análise de Componente Principal , SoftwareRESUMO
BACKGROUND: Organisms utilize a multitude of mechanisms for responding to changing environmental conditions, maintaining their functional homeostasis and to overcome stress situations. One of the most important mechanisms is transcriptional gene regulation. In-depth study of the transcriptional gene regulatory network can lead to various practical applications, creating a greater understanding of how organisms control their cellular behavior. DESCRIPTION: In this work, we present a new database, CMRegNet for the gene regulatory networks of Corynebacterium glutamicum ATCC 13032 and Mycobacterium tuberculosis H37Rv. We furthermore transferred the known networks of these model organisms to 18 other non-model but phylogenetically close species (target organisms) of the CMNR group. In comparison to other network transfers, for the first time we utilized two model organisms resulting into a more diverse and complete network of the target organisms. CONCLUSION: CMRegNet provides easy access to a total of 3,103 known regulations in C. glutamicum ATCC 13032 and M. tuberculosis H37Rv and to 38,940 evolutionary conserved interactions for 18 non-model species of the CMNR group. This makes CMRegNet to date the most comprehensive database of regulatory interactions of CMNR bacteria. The content of CMRegNet is publicly available online via a web interface found at http://lgcm.icb.ufmg.br/cmregnet .
Assuntos
Corynebacterium glutamicum/genética , Bases de Dados Genéticas , Redes Reguladoras de Genes , Mycobacterium tuberculosis/genética , Biologia Computacional , Corynebacterium glutamicum/classificação , Regulação Bacteriana da Expressão Gênica , Genes Bacterianos , Internet , Mycobacterium tuberculosis/classificação , FilogeniaRESUMO
MOTIVATION: Homology detection is a long-standing challenge in computational biology. To tackle this problem, typically all-versus-all BLAST results are coupled with data partitioning approaches resulting in clusters of putative homologous proteins. One of the main problems, however, has been widely neglected: all clustering tools need a density parameter that adjusts the number and size of the clusters. This parameter is crucial but hard to estimate without gold standard data at hand. Developing a gold standard, however, is a difficult and time consuming task. Having a reliable method for detecting clusters of homologous proteins between a huge set of species would open opportunities for better understanding the genetic repertoire of bacteria with different lifestyles. RESULTS: Our main contribution is a method for identifying a suitable and robust density parameter for protein homology detection without a given gold standard. Therefore, we study the core genome of 89 actinobacteria. This allows us to incorporate background knowledge, i.e. the assumption that a set of evolutionarily closely related species should share a comparably high number of evolutionarily conserved proteins (emerging from phylum-specific housekeeping genes). We apply our strategy to find genes/proteins that are specific for certain actinobacterial lifestyles, i.e. different types of pathogenicity. The whole study was performed with transitivity clustering, as it only requires a single intuitive density parameter and has been shown to be well applicable for the task of protein sequence clustering. Note, however, that the presented strategy generally does not depend on our clustering method but can easily be adapted to other clustering approaches. AVAILABILITY: All results are publicly available at http://transclust.mmci.uni-saarland.de/actino_core/ or as Supplementary Material of this article. CONTACT: roettger@mpi-inf.mpg.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Actinobacteria/classificação , Proteínas de Bactérias/química , Homologia de Sequência de Aminoácidos , Actinobacteria/genética , Actinobacteria/patogenicidade , Algoritmos , Proteínas de Bactérias/genética , Análise por Conglomerados , Genoma Bacteriano , Modelos Genéticos , Filogenia , Alinhamento de SequênciaRESUMO
Post-genomic analysis techniques such as next-generation sequencing have produced vast amounts of data about micro organisms including genetic sequences, their functional annotations and gene regulatory interactions. The latter are genetic mechanisms that control a cell's characteristics, for instance, pathogenicity as well as survival and reproduction strategies. CoryneRegNet is the reference database and analysis platform for corynebacterial gene regulatory networks. In this article we introduce the updated version 6.0 of CoryneRegNet and describe the updated database content which includes, 6352 corynebacterial regulatory interactions compared with 4928 interactions in release 5.0 and 3235 regulations in release 4.0, respectively. We also demonstrate how we support the community by integrating analysis and visualization features for transiently imported custom data, such as gene regulatory interactions. Furthermore, with release 6.0, we provide easy-to-use functions that allow the user to submit data for persistent storage with the CoryneRegNet database. Thus, it offers important options to its users in terms of community demands. CoryneRegNet is publicly available at http://www.coryneregnet.de.
Assuntos
Corynebacterium/genética , Bases de Dados Genéticas , Redes Reguladoras de Genes , Gráficos por Computador , Regulação Bacteriana da Expressão Gênica , Anotação de Sequência MolecularRESUMO
OBJECTIVE: The objective of this scoping review is to describe the scope and nature of research on the monitoring of clinical artificial intelligence (AI) systems. The review will identify the various methodologies used to monitor clinical AI, while also mapping the factors that influence the selection of monitoring approaches. INTRODUCTION: AI is being used in clinical decision-making at an increasing rate. While much attention has been directed toward the development and validation of AI for clinical applications, the practical implementation aspects, notably the establishment of rational monitoring/quality assurance systems, has received comparatively limited scientific interest. Given the scarcity of evidence and the heterogeneity of methodologies used in this domain, there is a compelling rationale for conducting a scoping review on this subject. INCLUSION CRITERIA: This scoping review will include any publications that describe systematic, continuous, or repeated initiatives that evaluate or predict clinical performance of AI models with direct implications for the management of patients in any segment of the health care system. METHODS: Publications will be identified through searches of the MEDLINE (Ovid), Embase (Ovid), and Scopus databases. Additionally, backward and forward citation searches, as well as a thorough investigation of gray literature, will be conducted. Title and abstract screening, full-text evaluation, and data extraction will be performed by 2 or more independent reviewers. Data will be extracted using a tool developed by the authors. The results will be presented graphically and narratively. REVIEW REGISTRATION: Open Science Framework https://osf.io/afkrn.
Assuntos
Inteligência Artificial , Literatura de Revisão como Assunto , HumanosRESUMO
Motivation: Nanobodies are a subclass of immunoglobulins, whose binding site consists of only one peptide chain, bestowing favorable biophysical properties. Recently, the first nanobody therapy was approved, paving the way for further clinical applications of this antibody format. Further development of nanobody-based therapeutics could be streamlined by computational methods. One of such methods is infilling-positional prediction of biologically feasible mutations in nanobodies. Being able to identify possible positional substitutions based on sequence context, facilitates functional design of such molecules. Results: Here we present nanoBERT, a nanobody-specific transformer to predict amino acids in a given position in a query sequence. We demonstrate the need to develop such machine-learning based protocol as opposed to gene-specific positional statistics since appropriate genetic reference is not available. We benchmark nanoBERT with respect to human-based language models and ESM-2, demonstrating the benefit for domain-specific language models. We also demonstrate the benefit of employing nanobody-specific predictions for fine-tuning on experimentally measured thermostability dataset. We hope that nanoBERT will help engineers in a range of predictive tasks for designing therapeutic nanobodies. Availability and implementation: https://huggingface.co/NaturalAntibody/.