RESUMEN
With the fast-growing and evolving omics data, the demand for streamlined and adaptable tools to handle bioinformatics analysis continues to grow. In response to this need, Automated Bioinformatics Analysis (AutoBA) is introduced, an autonomous AI agent designed explicitly for fully automated multi-omic analyses based on large language models (LLMs). AutoBA simplifies the analytical process by requiring minimal user input while delivering detailed step-by-step plans for various bioinformatics tasks. AutoBA's unique capacity to self-design analysis processes based on input data variations further underscores its versatility. Compared with online bioinformatic services, AutoBA offers multiple LLM backends, with options for both online and local usage, prioritizing data security and user privacy. In comparison to ChatGPT and open-source LLMs, an automated code repair (ACR) mechanism in AutoBA is designed to improve its stability in automated end-to-end bioinformatics analysis tasks. Moreover, different from the predefined pipeline, AutoBA has adaptability in sync with emerging bioinformatics tools. Overall, AutoBA represents an advanced and convenient tool, offering robustness and adaptability for conventional multi-omic analyses.
RESUMEN
Large language models (LLMs) are seen to have tremendous potential in advancing medical diagnosis recently, particularly in dermatological diagnosis, which is a very important task as skin and subcutaneous diseases rank high among the leading contributors to the global burden of nonfatal diseases. Here we present SkinGPT-4, which is an interactive dermatology diagnostic system based on multimodal large language models. We have aligned a pre-trained vision transformer with an LLM named Llama-2-13b-chat by collecting an extensive collection of skin disease images (comprising 52,929 publicly available and proprietary images) along with clinical concepts and doctors' notes, and designing a two-step training strategy. We have quantitatively evaluated SkinGPT-4 on 150 real-life cases with board-certified dermatologists. With SkinGPT-4, users could upload their own skin photos for diagnosis, and the system could autonomously evaluate the images, identify the characteristics and categories of the skin conditions, perform in-depth analysis, and provide interactive treatment recommendations.
Asunto(s)
Dermatología , Aplicaciones Móviles , Enfermedades de la Piel , Enfermedades de la Piel/diagnóstico , Modelos Biológicos , Simulación por Computador , Aprendizaje Automático , Dermatología/métodosRESUMEN
MOTIVATION: Macrocyclic peptides hold great promise as therapeutics targeting intracellular proteins. This stems from their remarkable ability to bind flat protein surfaces with high affinity and specificity while potentially traversing the cell membrane. Research has already explored their use in developing inhibitors for intracellular proteins, such as KRAS, a well-known driver in various cancers. However, computational approaches for de novo macrocyclic peptide design remain largely unexplored. RESULTS: Here, we introduce HELM-GPT, a novel method that combines the strength of the hierarchical editing language for macromolecules (HELM) representation and generative pre-trained transformer (GPT) for de novo macrocyclic peptide design. Through reinforcement learning (RL), our experiments demonstrate that HELM-GPT has the ability to generate valid macrocyclic peptides and optimize their properties. Furthermore, we introduce a contrastive preference loss during the RL process, further enhanced the optimization performance. Finally, to co-optimize peptide permeability and KRAS binding affinity, we propose a step-by-step optimization strategy, demonstrating its effectiveness in generating molecules fulfilling both criteria. In conclusion, the HELM-GPT method can be used to identify novel macrocyclic peptides to target intracellular proteins. AVAILABILITY AND IMPLEMENTATION: The code and data of HELM-GPT are freely available on GitHub (https://github.com/charlesxu90/helm-gpt).
Asunto(s)
Péptidos Cíclicos , Péptidos Cíclicos/química , Biología Computacional/métodos , Diseño de Fármacos , Péptidos/química , Humanos , Algoritmos , Programas InformáticosRESUMEN
Automated karyotyping is of great importance for cytogenetic research, as it speeds up the process for cytogeneticists through incorporating AI-driven automated segmentation and classification techniques. Existing frameworks confront two primary issues: Firstly the necessity for instance-level data annotation with either detection bounding boxes or semantic masks for training, and secondly, its poor robustness particularly when confronted with domain shifts. In this work, we first propose an accurate segmentation framework, namely KaryoXpert. This framework leverages the strengths of both morphology algorithms and deep learning models, allowing for efficient training that breaks the limit for the acquirement of manually labeled ground-truth mask annotations. Additionally, we present an accurate classification model based on metric learning, designed to overcome the challenges posed by inter-class similarity and batch effects. Our framework exhibits state-of-the-art performance with exceptional robustness in both chromosome segmentation and classification. The proposed KaryoXpert framework showcases its capacity for instance-level chromosome segmentation even in the absence of annotated data, offering novel insights into the research for automated chromosome segmentation. The proposed method has been successfully deployed to support clinical karyotype diagnosis.
Asunto(s)
Cariotipificación , Humanos , Cariotipificación/métodos , Metafase , Algoritmos , Cromosomas Humanos/genética , Procesamiento de Imagen Asistido por Computador/métodos , Aprendizaje ProfundoRESUMEN
Artificial intelligence (AI) in omics analysis raises privacy threats to patients. Here, we briefly discuss risk factors to patient privacy in data sharing, model training, and release, as well as methods to safeguard and evaluate patient privacy in AI-driven omics methods.
Asunto(s)
Inteligencia Artificial , Genómica , Humanos , Genómica/métodos , Privacidad , Difusión de la InformaciónRESUMEN
Modern machine learning models toward various tasks with omic data analysis give rise to threats of privacy leakage of patients involved in those datasets. Here, we proposed a secure and privacy-preserving machine learning method (PPML-Omics) by designing a decentralized differential private federated learning algorithm. We applied PPML-Omics to analyze data from three sequencing technologies and addressed the privacy concern in three major tasks of omic data under three representative deep learning models. We examined privacy breaches in depth through privacy attack experiments and demonstrated that PPML-Omics could protect patients' privacy. In each of these applications, PPML-Omics was able to outperform methods of comparison under the same level of privacy guarantee, demonstrating the versatility of the method in simultaneously balancing the privacy-preserving capability and utility in omic data analysis. Furthermore, we gave the theoretical proof of the privacy-preserving capability of PPML-Omics, suggesting the first mathematically guaranteed method with robust and generalizable empirical performance in protecting patients' privacy in omic data.
Asunto(s)
Algoritmos , Privacidad , Humanos , Análisis de Datos , Aprendizaje Automático , TecnologíaRESUMEN
Heterogeneous data is endemic due to the use of diverse models and settings of devices by hospitals in the field of medical imaging. However, there are few open-source frameworks for federated heterogeneous medical image analysis with personalization and privacy protection without the demand to modify the existing model structures or to share any private data. Here, we proposed PPPML-HMI, a novel open-source learning paradigm for personalized and privacy-preserving federated heterogeneous medical image analysis. To our best knowledge, personalization and privacy protection were discussed simultaneously for the first time under the federated scenario by integrating the PerFedAvg algorithm and designing the novel cyclic secure aggregation with the homomorphic encryption algorithm. To show the utility of PPPML-HMI, we applied it to a simulated classification task namely the classification of healthy people and patients from the RAD-ChestCT Dataset, and one real-world segmentation task namely the segmentation of lung infections from COVID-19 CT scans. Meanwhile, we applied the improved deep leakage from gradients to simulate adversarial attacks and showed the strong privacy-preserving capability of PPPML-HMI. By applying PPPML-HMI to both tasks with different neural networks, a varied number of users, and sample sizes, we demonstrated the strong generalizability of PPPML-HMI in privacy-preserving federated learning on heterogeneous medical images.
Asunto(s)
COVID-19 , Privacidad , Humanos , Algoritmos , Hospitales , AprendizajeRESUMEN
Revoking personal private data is one of the basic human rights. However, such right is often overlooked or infringed upon due to the increasing collection and use of patient data for model training. In order to secure patients' right to be forgotten, we proposed a solution by using auditing to guide the forgetting process, where auditing means determining whether a dataset has been used to train the model and forgetting requires the information of a query dataset to be forgotten from the target model. We unified these two tasks by introducing an approach called knowledge purification. To implement our solution, we developed an audit to forget software (AFS), which is able to evaluate and revoke patients' private data from pre-trained deep learning models. Here, we show the usability of AFS and its application potential in real-world intelligent healthcare to enhance privacy protection and data revocation rights.
Asunto(s)
Seguridad Computacional , Privacidad , Humanos , Confidencialidad , Programas Informáticos , Atención a la SaludRESUMEN
Repetitive DNA sequences playing critical roles in driving evolution, inducing variation, and regulating gene expression. In this review, we summarized the definition, arrangement, and structural characteristics of repeats. Besides, we introduced diverse biological functions of repeats and reviewed existing methods for automatic repeat detection, classification, and masking. Finally, we analyzed the type, structure, and regulation of repeats in the human genome and their role in the induction of complex diseases. We believe that this review will facilitate a comprehensive understanding of repeats and provide guidance for repeat annotation and in-depth exploration of its association with human diseases.
Asunto(s)
Genoma Humano , Humanos , Secuencia de BasesRESUMEN
Antibody leads must fulfill multiple desirable properties to be clinical candidates. Primarily due to the low throughput in the experimental procedure, the need for such multi-property optimization causes the bottleneck in preclinical antibody discovery and development, because addressing one issue usually causes another. We developed a reinforcement learning (RL) method, named AB-Gen, for antibody library design using a generative pre-trained transformer (GPT) as the policy network of the RL agent. We showed that this model can learn the antibody space of heavy chain complementarity determining region 3 (CDRH3) and generate sequences with similar property distributions. Besides, when using human epidermal growth factor receptor-2 (HER2) as the target, the agent model of AB-Gen was able to generate novel CDRH3 sequences that fulfill multi-property constraints. Totally, 509 generated sequences were able to pass all property filters, and three highly conserved residues were identified. The importance of these residues was further demonstrated by molecular dynamics simulations, consolidating that the agent model was capable of grasping important information in this complex optimization task. Overall, the AB-Gen method is able to design novel antibody sequences with an improved success rate than the traditional propose-then-filter approach. It has the potential to be used in practical antibody design, thus empowering the antibody discovery and development process. The source code of AB-Gen is freely available at Zenodo (https://doi.org/10.5281/zenodo.7657016) and BioCode (https://ngdc.cncb.ac.cn/biocode/tools/BT007341).
Asunto(s)
Anticuerpos , Simulación de Dinámica Molecular , Humanos , Biblioteca de Genes , Programas InformáticosRESUMEN
The relentless evolution of SARS-CoV-2 poses a significant threat to public health, as it adapts to immune pressure from vaccines and natural infections. Gaining insights into potential antigenic changes is critical but challenging due to the vast sequence space. Here, we introduce the Machine Learning-guided Antigenic Evolution Prediction (MLAEP), which combines structure modeling, multi-task learning, and genetic algorithms to predict the viral fitness landscape and explore antigenic evolution via in silico directed evolution. By analyzing existing SARS-CoV-2 variants, MLAEP accurately infers variant order along antigenic evolutionary trajectories, correlating with corresponding sampling time. Our approach identified novel mutations in immunocompromised COVID-19 patients and emerging variants like XBB1.5. Additionally, MLAEP predictions were validated through in vitro neutralizing antibody binding assays, demonstrating that the predicted variants exhibited enhanced immune evasion. By profiling existing variants and predicting potential antigenic changes, MLAEP aids in vaccine development and enhances preparedness against future SARS-CoV-2 variants.
Asunto(s)
COVID-19 , Aprendizaje Profundo , Humanos , SARS-CoV-2/genética , Anticuerpos NeutralizantesRESUMEN
Alternative polyadenylation (APA) enables a gene to generate multiple transcripts with different 3' ends, which is dynamic across different cell types or conditions. Many computational methods have been developed to characterize sample-specific APA using the corresponding RNA-seq data, but suffered from high error rate on both polyadenylation site (PAS) identification and quantification of PAS usage (PAU), and bias toward 3' untranslated regions. Here we developed a tool for APA identification and quantification (APAIQ) from RNA-seq data, which can accurately identify PAS and quantify PAU in a transcriptome-wide manner. Using 3' end-seq data as the benchmark, we showed that APAIQ outperforms current methods on PAS identification and PAU quantification, including DaPars2, Aptardi, mountainClimber, SANPolyA, and QAPA. Finally, applying APAIQ on 421 RNA-seq samples from liver cancer patients, we identified >540 tumor-associated APA events and experimentally validated two intronic polyadenylation candidates, demonstrating its capacity to unveil cancer-related APA with a large-scale RNA-seq data set.
Asunto(s)
Neoplasias , Transcriptoma , Humanos , Poliadenilación , RNA-Seq , Análisis de Secuencia de ARN/métodos , Neoplasias/genética , Regiones no Traducidas 3'RESUMEN
Spatial transcriptomics technologies are used to profile transcriptomes while preserving spatial information, which enables high-resolution characterization of transcriptional patterns and reconstruction of tissue architecture. Due to the existence of low-resolution spots in recent spatial transcriptomics technologies, uncovering cellular heterogeneity is crucial for disentangling the spatial patterns of cell types, and many related methods have been proposed. Here, we benchmark 18 existing methods resolving a cellular deconvolution task with 50 real-world and simulated datasets by evaluating the accuracy, robustness, and usability of the methods. We compare these methods comprehensively using different metrics, resolutions, spatial transcriptomics technologies, spot numbers, and gene numbers. In terms of performance, CARD, Cell2location, and Tangram are the best methods for conducting the cellular deconvolution task. To refine our comparative results, we provide decision-tree-style guidelines and recommendations for method selection and their additional features, which will help users easily choose the best method for fulfilling their concerns.
Asunto(s)
Benchmarking , Transcriptoma , Transcriptoma/genética , Perfilación de la Expresión Génica , TecnologíaRESUMEN
Gene regulation is a central topic in cell biology. Advances in omics technologies and the accumulation of omics data have provided better opportunities for gene regulation studies than ever before. For this reason deep learning, as a data-driven predictive modeling approach, has been successfully applied to this field during the past decade. In this article, we aim to give a brief yet comprehensive overview of representative deep-learning methods for gene regulation. Specifically, we discuss and compare the design principles and datasets used by each method, creating a reference for researchers who wish to replicate or improve existing methods. We also discuss the common problems of existing approaches and prospectively introduce the emerging deep-learning paradigms that will potentially alleviate them. We hope that this article will provide a rich and up-to-date resource and shed light on future research directions in this area.
Asunto(s)
Aprendizaje Profundo , Biología Computacional/métodosRESUMEN
Background: The key challenge in drug discovery is to discover novel compounds with desirable properties. Among the properties, binding affinity to a target is one of the prerequisites and usually evaluated by molecular docking or quantitative structure activity relationship (QSAR) models. Methods: In this study, we developed SGPT-RL, which uses a generative pre-trained transformer (GPT) as the policy network of the reinforcement learning (RL) agent to optimize the binding affinity to a target. SGPT-RL was evaluated on the Moses distribution learning benchmark and two goal-directed generation tasks, with Dopamine Receptor D2 (DRD2) and Angiotensin-Converting Enzyme 2 (ACE2) as the targets. Both QSAR model and molecular docking were implemented as the optimization goals in the tasks. The popular Reinvent method was used as the baseline for comparison. Results: The results on the Moses benchmark showed that SGPT-RL learned good property distributions and generated molecules with high validity and novelty. On the two goal-directed generation tasks, both SGPT-RL and Reinvent were able to generate valid molecules with improved target scores. The SGPT-RL method achieved better results than Reinvent on the ACE2 task, where molecular docking was used as the optimization goal. Further analysis shows that SGPT-RL learned conserved scaffold patterns during exploration. Conclusions: The superior performance of SGPT-RL in the ACE2 task indicates that it can be applied to the virtual screening process where molecular docking is widely used as the criteria. Besides, the scaffold patterns learned by SGPT-RL during the exploration process can assist chemists to better design and discover novel lead candidates.
Asunto(s)
Enzima Convertidora de Angiotensina 2 , Aprendizaje , Alanina Transaminasa , Simulación del Acoplamiento Molecular , BenchmarkingRESUMEN
The accurate annotation of transcription start sites (TSSs) and their usage are critical for the mechanistic understanding of gene regulation in different biological contexts. To fulfill this, specific high-throughput experimental technologies have been developed to capture TSSs in a genome-wide manner, and various computational tools have also been developed for in silico prediction of TSSs solely based on genomic sequences. Most of these computational tools cast the problem as a binary classification task on a balanced dataset, thus resulting in drastic false positive predictions when applied on the genome scale. Here, we present DeeReCT-TSS, a deep learning-based method that is capable of identifying TSSs across the whole genome based on both DNA sequence and conventional RNA sequencing data. We show that by effectively incorporating these two sources of information, DeeReCT-TSS significantly outperforms other solely sequence-based methods on the precise annotation of TSSs used in different cell types. Furthermore, we develop a meta-learning-based extension for simultaneous TSS annotations on 10 cell types, which enables the identification of cell type-specific TSSs. Finally, we demonstrate the high precision of DeeReCT-TSS on two independent datasets by correlating our predicted TSSs with experimentally defined TSS chromatin states. The source code for DeeReCT-TSS is available at https://github.com/JoshuaChou2018/DeeReCT-TSS_release and https://ngdc.cncb.ac.cn/biocode/tools/BT007316.
Asunto(s)
Genómica , RNA-Seq , Secuencia de Bases , Sitio de Iniciación de la Transcripción , Análisis de Secuencia de ARN/métodosRESUMEN
MOTIVATION: Unveiling the heterogeneity in the tissues is crucial to explore cell-cell interactions and cellular targets of human diseases. Spatial transcriptomics (ST) supplies spatial gene expression profile which has revolutionized our biological understanding, but variations in cell-type proportions of each spot with dozens of cells would confound downstream analysis. Therefore, deconvolution of ST has been an indispensable step and a technical challenge toward the higher-resolution panorama of tissues. RESULTS: Here, we propose a novel ST deconvolution method called SD2 integrating spatial information of ST data and embracing an important characteristic, dropout, which is traditionally considered as an obstruction in single-cell RNA sequencing data (scRNA-seq) analysis. First, we extract the dropout-based genes as informative features from ST and scRNA-seq data by fitting a Michaelis-Menten function. After synthesizing pseudo-ST spots by randomly composing cells from scRNA-seq data, auto-encoder is applied to discover low-dimensional and non-linear representation of the real- and pseudo-ST spots. Next, we create a graph containing embedded profiles as nodes, and edges determined by transcriptional similarity and spatial relationship. Given the graph, a graph convolutional neural network is used to predict the cell-type compositions for real-ST spots. We benchmark the performance of SD2 on the simulated seqFISH+ dataset with different resolutions and measurements which show superior performance compared with the state-of-the-art methods. SD2 is further validated on three real-world datasets with different ST technologies and demonstrates the capability to localize cell-type composition accurately with quantitative evidence. Finally, ablation study is conducted to verify the contribution of different modules proposed in SD2. AVAILABILITY AND IMPLEMENTATION: The SD2 is freely available in github (https://github.com/leihouyeung/SD2) and Zenodo (https://doi.org/10.5281/zenodo.7024684). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Análisis de la Célula Individual , Transcriptoma , Humanos , Análisis de Secuencia de ARN , Perfilación de la Expresión Génica , Programas InformáticosRESUMEN
Alternative polyadenylation (APA) is a crucial step in post-transcriptional regulation. Previous bioinformatic studies have mainly focused on the recognition of polyadenylation sites (PASs) in a given genomic sequence, which is a binary classification problem. Recently, computational methods for predicting the usage level of alternative PASs in the same gene have been proposed. However, all of them cast the problem as a non-quantitative pairwise comparison task and do not take the competition among multiple PASs into account. To address this, here we propose a deep learning architecture, Deep Regulatory Code and Tools for Alternative Polyadenylation (DeeReCT-APA), to quantitatively predict the usage of all alternative PASs of a given gene. To accommodate different genes with potentially different numbers of PASs, DeeReCT-APA treats the problem as a regression task with a variable-length target. Based on a convolutional neural network-long short-term memory (CNN-LSTM) architecture, DeeReCT-APA extracts sequence features with CNN layers, uses bidirectional LSTM to explicitly model the interactions among competing PASs, and outputs percentage scores representing the usage levels of all PASs of a gene. In addition to the fact that only our method can quantitatively predict the usage of all the PASs within a gene, we show that our method consistently outperforms other existing methods on three different tasks for which they are trained: pairwise comparison task, highest usage prediction task, and ranking task. Finally, we demonstrate that our method can be used to predict the effect of genetic variations on APA patterns and sheds light on future mechanistic understanding in APA regulation. Our code and data are available at https://github.com/lzx325/DeeReCT-APA-repo.
Asunto(s)
Aprendizaje Profundo , Poliadenilación , Regulación de la Expresión Génica , Redes Neurales de la Computación , Biología Computacional/métodos , Regiones no Traducidas 3'RESUMEN
Periodontitis is a prevalent and irreversible chronic inflammatory disease both in developed and developing countries, and affects about 20-50% of the global population. The tool for automatically diagnosing periodontitis is highly demanded to screen at-risk people for periodontitis and its early detection could prevent the onset of tooth loss, especially in local communities and health care settings with limited dental professionals. In the medical field, doctors need to understand and trust the decisions made by computational models and developing interpretable models is crucial for disease diagnosis. Based on these considerations, we propose an interpretable method called Deetal-Perio to predict the severity degree of periodontitis in dental panoramic radiographs. In our method, alveolar bone loss (ABL), the clinical hallmark for periodontitis diagnosis, could be interpreted as the key feature. To calculate ABL, we also propose a method for teeth numbering and segmentation. First, Deetal-Perio segments and indexes the individual tooth via Mask R-CNN combined with a novel calibration method. Next, Deetal-Perio segments the contour of the alveolar bone and calculates a ratio for individual tooth to represent ABL. Finally, Deetal-Perio predicts the severity degree of periodontitis given the ratios of all the teeth. The Macro F1-score and accuracy of the periodontitis prediction task in our method reach 0.894 and 0.896, respectively, on Suzhou data set, and 0.820 and 0.824, respectively on Zhongshan data set. The entire architecture could not only outperform state-of-the-art methods and show robustness on two data sets in both periodontitis prediction, and teeth numbering and segmentation tasks, but also be interpretable for doctors to understand the reason why Deetal-Perio works so well.
RESUMEN
COVID-19 has caused a global pandemic and become the most urgent threat to the entire world. Tremendous efforts and resources have been invested in developing diagnosis, prognosis and treatment strategies to combat the disease. Although nucleic acid detection has been mainly used as the gold standard to confirm this RNA virus-based disease, it has been shown that such a strategy has a high false negative rate, especially for patients in the early stage, and thus CT imaging has been applied as a major diagnostic modality in confirming positive COVID-19. Despite the various, urgent advances in developing artificial intelligence (AI)-based computer-aided systems for CT-based COVID-19 diagnosis, most of the existing methods can only perform classification, whereas the state-of-the-art segmentation method requires a high level of human intervention. In this paper, we propose a fully-automatic, rapid, accurate, and machine-agnostic method that can segment and quantify the infection regions on CT scans from different sources. Our method is founded upon two innovations: 1) the first CT scan simulator for COVID-19, by fitting the dynamic change of real patients' data measured at different time points, which greatly alleviates the data scarcity issue; and 2) a novel deep learning algorithm to solve the large-scene-small-object problem, which decomposes the 3D segmentation problem into three 2D ones, and thus reduces the model complexity by an order of magnitude and, at the same time, significantly improves the segmentation accuracy. Comprehensive experimental results over multi-country, multi-hospital, and multi-machine datasets demonstrate the superior performance of our method over the existing ones and suggest its important application value in combating the disease.