RESUMO
Understanding how genetic variants impact molecular phenotypes is a key goal of functional genomics, currently hindered by reliance on a single haploid reference genome. Here, we present the EN-TEx resource of 1,635 open-access datasets from four donors (â¼30 tissues × â¼15 assays). The datasets are mapped to matched, diploid genomes with long-read phasing and structural variants, instantiating a catalog of >1 million allele-specific loci. These loci exhibit coordinated activity along haplotypes and are less conserved than corresponding, non-allele-specific ones. Surprisingly, a deep-learning transformer model can predict the allele-specific activity based only on local nucleotide-sequence context, highlighting the importance of transcription-factor-binding motifs particularly sensitive to variants. Furthermore, combining EN-TEx with existing genome annotations reveals strong associations between allele-specific and GWAS loci. It also enables models for transferring known eQTLs to difficult-to-profile tissues (e.g., from skin to heart). Overall, EN-TEx provides rich data and generalizable models for more accurate personal functional genomics.
Assuntos
Epigenoma , Locos de Características Quantitativas , Estudo de Associação Genômica Ampla , Genômica , Fenótipo , Polimorfismo de Nucleotídeo ÚnicoRESUMO
Machine learning has been proposed as an alternative to theoretical modeling when dealing with complex problems in biological physics. However, in this perspective, we argue that a more successful approach is a proper combination of these two methodologies. We discuss how ideas coming from physical modeling neuronal processing led to early formulations of computational neural networks, e.g., Hopfield networks. We then show how modern learning approaches like Potts models, Boltzmann machines, and the transformer architecture are related to each other, specifically, through a shared energy representation. We summarize recent efforts to establish these connections and provide examples on how each of these formulations integrating physical modeling and machine learning have been successful in tackling recent problems in biomolecular structure, dynamics, function, evolution, and design. Instances include protein structure prediction; improvement in computational complexity and accuracy of molecular dynamics simulations; better inference of the effects of mutations in proteins leading to improved evolutionary modeling and finally how machine learning is revolutionizing protein engineering and design. Going beyond naturally existing protein sequences, a connection to protein design is discussed where synthetic sequences are able to fold to naturally occurring motifs driven by a model rooted in physical principles. We show that this model is "learnable" and propose its future use in the generation of unique sequences that can fold into a target structure.
Assuntos
Aprendizado de Máquina , Redes Neurais de Computação , Proteínas , Proteínas/química , Proteínas/metabolismo , Engenharia de Proteínas/métodos , Simulação de Dinâmica MolecularRESUMO
Aging in an individual refers to the temporal change, mostly decline, in the body's ability to meet physiological demands. Biological age (BA) is a biomarker of chronological aging and can be used to stratify populations to predict certain age-related chronic diseases. BA can be predicted from biomedical features such as brain MRI, retinal, or facial images, but the inherent heterogeneity in the aging process limits the usefulness of BA predicted from individual body systems. In this paper, we developed a multimodal Transformer-based architecture with cross-attention which was able to combine facial, tongue, and retinal images to estimate BA. We trained our model using facial, tongue, and retinal images from 11,223 healthy subjects and demonstrated that using a fusion of the three image modalities achieved the most accurate BA predictions. We validated our approach on a test population of 2,840 individuals with six chronic diseases and obtained significant difference between chronological age and BA (AgeDiff) than that of healthy subjects. We showed that AgeDiff has the potential to be utilized as a standalone biomarker or conjunctively alongside other known factors for risk stratification and progression prediction of chronic diseases. Our results therefore highlight the feasibility of using multimodal images to estimate and interrogate the aging process.
Assuntos
Envelhecimento , Fontes de Energia Elétrica , Humanos , Face , Biomarcadores , Doença CrônicaRESUMO
N4-acetylcytidine (ac4C) is a modification found in ribonucleic acid (RNA) related to diseases. Expensive and labor-intensive methods hindered the exploration of ac4C mechanisms and the development of specific anti-ac4C drugs. Therefore, an advanced prediction model for ac4C in RNA is urgently needed. Despite the construction of various prediction models, several limitations exist: (1) insufficient resolution at base level for ac4C sites; (2) lack of information on species other than Homo sapiens; (3) lack of information on RNA other than mRNA; and (4) lack of interpretation for each prediction. In light of these limitations, we have reconstructed the previous benchmark dataset and introduced a new dataset including balanced RNA sequences from multiple species and RNA types, while also providing base-level resolution for ac4C sites. Additionally, we have proposed a novel transformer-based architecture and pipeline for predicting ac4C sites, allowing for highly accurate predictions, visually interpretable results and no restrictions on the length of input RNA sequences. Statistically, our work has improved the accuracy of predicting specific ac4C sites in multiple species from less than 40% to around 85%, achieving a high AUC > 0.9. These results significantly surpass the performance of all existing models.
Assuntos
Citidina , Citidina/análogos & derivados , RNA , Citidina/genética , RNA/genética , RNA/química , Humanos , Biologia Computacional/métodos , Animais , Software , AlgoritmosRESUMO
Protein acetylation is one of the extensively studied post-translational modifications (PTMs) due to its significant roles across a myriad of biological processes. Although many computational tools for acetylation site identification have been developed, there is a lack of benchmark dataset and bespoke predictors for non-histone acetylation site prediction. To address these problems, we have contributed to both dataset creation and predictor benchmark in this study. First, we construct a non-histone acetylation site benchmark dataset, namely NHAC, which includes 11 subsets according to the sequence length ranging from 11 to 61 amino acids. There are totally 886 positive samples and 4707 negative samples for each sequence length. Secondly, we propose TransPTM, a transformer-based neural network model for non-histone acetylation site predication. During the data representation phase, per-residue contextualized embeddings are extracted using ProtT5 (an existing pre-trained protein language model). This is followed by the implementation of a graph neural network framework, which consists of three TransformerConv layers for feature extraction and a multilayer perceptron module for classification. The benchmark results reflect that TransPTM has the competitive performance for non-histone acetylation site prediction over three state-of-the-art tools. It improves our comprehension on the PTM mechanism and provides a theoretical basis for developing drug targets for diseases. Moreover, the created PTM datasets fills the gap in non-histone acetylation site datasets and is beneficial to the related communities. The related source code and data utilized by TransPTM are accessible at https://www.github.com/TransPTM/TransPTM.
Assuntos
Redes Neurais de Computação , Processamento de Proteína Pós-Traducional , Acetilação , Biologia Computacional/métodos , Bases de Dados de Proteínas , Software , Algoritmos , Humanos , Proteínas/química , Proteínas/metabolismoRESUMO
Spatial transcriptomics technologies have shed light on the complexities of tissue structures by accurately mapping spatial microenvironments. Nonetheless, a myriad of methods, especially those utilized in platforms like Visium, often relinquish spatial details owing to intrinsic resolution limitations. In response, we introduce TransformerST, an innovative, unsupervised model anchored in the Transformer architecture, which operates independently of references, thereby ensuring cost-efficiency by circumventing the need for single-cell RNA sequencing. TransformerST not only elevates Visium data from a multicellular level to a single-cell granularity but also showcases adaptability across diverse spatial transcriptomics platforms. By employing a vision transformer-based encoder, it discerns latent image-gene expression co-representations and is further enhanced by spatial correlations, derived from an adaptive graph Transformer module. The sophisticated cross-scale graph network, utilized in super-resolution, significantly boosts the model's accuracy, unveiling complex structure-functional relationships within histology images. Empirical evaluations validate its adeptness in revealing tissue subtleties at the single-cell scale. Crucially, TransformerST adeptly navigates through image-gene co-representation, maximizing the synergistic utility of gene expression and histology images, thereby emerging as a pioneering tool in spatial transcriptomics. It not only enhances resolution to a single-cell level but also introduces a novel approach that optimally utilizes histology images alongside gene expression, providing a refined lens for investigating spatial transcriptomics.
Assuntos
Perfilação da Expressão Gênica , Expressão GênicaRESUMO
Accurate prediction of molecular properties is fundamental in drug discovery and development, providing crucial guidance for effective drug design. A critical factor in achieving accurate molecular property prediction lies in the appropriate representation of molecular structures. Presently, prevalent deep learning-based molecular representations rely on 2D structure information as the primary molecular representation, often overlooking essential three-dimensional (3D) conformational information due to the inherent limitations of 2D structures in conveying atomic spatial relationships. In this study, we propose employing the Gram matrix as a condensed representation of 3D molecular structures and for efficient pretraining objectives. Subsequently, we leverage this matrix to construct a novel molecular representation model, Pre-GTM, which inherently encapsulates 3D information. The model accurately predicts the 3D structure of a molecule by estimating the Gram matrix. Our findings demonstrate that Pre-GTM model outperforms the baseline Graphormer model and other pretrained models in the QM9 and MoleculeNet quantitative property prediction task. The integration of the Gram matrix as a condensed representation of 3D molecular structure, incorporated into the Pre-GTM model, opens up promising avenues for its potential application across various domains of molecular research, including drug design, materials science, and chemical engineering.
Assuntos
Conformação Molecular , Modelos Moleculares , Desenho de Fármacos , Aprendizado Profundo , Descoberta de Drogas , AlgoritmosRESUMO
With the exponential growth of digital data, there is a pressing need for innovative storage media and techniques. DNA molecules, due to their stability, storage capacity, and density, offer a promising solution for information storage. However, DNA storage also faces numerous challenges, such as complex biochemical constraints and encoding efficiency. This paper presents Explorer, a high-efficiency DNA coding algorithm based on the De Bruijn graph, which leverages its capability to characterize local sequences. Explorer enables coding under various biochemical constraints, such as homopolymers, GC content, and undesired motifs. This paper also introduces Codeformer, a fast decoding algorithm based on the transformer architecture, to further enhance decoding efficiency. Numerical experiments indicate that, compared with other advanced algorithms, Explorer not only achieves stable encoding and decoding under various biochemical constraints but also increases the encoding efficiency and bit rate by ¿10%. Additionally, Codeformer demonstrates the ability to efficiently decode large quantities of DNA sequences. Under different parameter settings, its decoding efficiency exceeds that of traditional algorithms by more than two-fold. When Codeformer is combined with Reed-Solomon code, its decoding accuracy exceeds 99%, making it a good choice for high-speed decoding applications. These advancements are expected to contribute to the development of DNA-based data storage systems and the broader exploration of DNA as a novel information storage medium.
Assuntos
Algoritmos , DNA , DNA/genética , DNA/química , Software , Análise de Sequência de DNA/métodos , Biologia Computacional/métodosRESUMO
In recent years, the advent of spatial transcriptomics (ST) technology has unlocked unprecedented opportunities for delving into the complexities of gene expression patterns within intricate biological systems. Despite its transformative potential, the prohibitive cost of ST technology remains a significant barrier to its widespread adoption in large-scale studies. An alternative, more cost-effective strategy involves employing artificial intelligence to predict gene expression levels using readily accessible whole-slide images stained with Hematoxylin and Eosin (H&E). However, existing methods have yet to fully capitalize on multimodal information provided by H&E images and ST data with spatial location. In this paper, we propose mclSTExp, a multimodal contrastive learning with Transformer and Densenet-121 encoder for Spatial Transcriptomics Expression prediction. We conceptualize each spot as a "word", integrating its intrinsic features with spatial context through the self-attention mechanism of a Transformer encoder. This integration is further enriched by incorporating image features via contrastive learning, thereby enhancing the predictive capability of our model. We conducted an extensive evaluation of highly variable genes in two breast cancer datasets and a skin squamous cell carcinoma dataset, and the results demonstrate that mclSTExp exhibits superior performance in predicting spatial gene expression. Moreover, mclSTExp has shown promise in interpreting cancer-specific overexpressed genes, elucidating immune-related genes, and identifying specialized spatial domains annotated by pathologists. Our source code is available at https://github.com/shizhiceng/mclSTExp.
Assuntos
Perfilação da Expressão Gênica , Humanos , Perfilação da Expressão Gênica/métodos , Neoplasias da Mama/genética , Neoplasias da Mama/patologia , Neoplasias da Mama/metabolismo , Algoritmos , Processamento de Imagem Assistida por Computador/métodos , Biologia Computacional/métodos , Transcriptoma , Feminino , Aprendizado de Máquina , Regulação Neoplásica da Expressão GênicaRESUMO
Target enrichment sequencing techniques are gaining widespread use in the field of genomics, prized for their economic efficiency and swift processing times. However, their success depends on the performance of probes and the evenness of sequencing depth among each probe. To accurately predict probe coverage depth, a model called Deqformer is proposed in this study. Deqformer utilizes the oligonucleotides sequence of each probe, drawing inspiration from Watson-Crick base pairing and incorporating two BERT encoders to capture the underlying information from the forward and reverse probe strands, respectively. The encoded data are combined with a feed-forward network to make precise predictions of sequencing depth. The performance of Deqformer is evaluated on four different datasets: SNP panel with 38 200 probes, lncRNA panel with 2000 probes, synthetic panel with 5899 probes and HD-Marker panel for Yesso scallop with 11 000 probes. The SNP and synthetic panels achieve impressive factor 3 of accuracy (F3acc) of 96.24% and 99.66% in 5-fold cross-validation. F3acc rates of over 87.33% and 72.56% are obtained when training on the SNP panel and evaluating performance on the lncRNA and HD-Marker datasets, respectively. Our analysis reveals that Deqformer effectively captures hybridization patterns, making it robust for accurate predictions in various scenarios. Deqformer leads to a novel perspective for probe design pipeline, aiming to enhance efficiency and effectiveness in probe design tasks.
Assuntos
Aprendizado Profundo , RNA Longo não Codificante , Sondas de DNA/genética , Hibridização de Ácido Nucleico , GenômicaRESUMO
Histone modifications (HMs) are pivotal in various biological processes, including transcription, replication, and DNA repair, significantly impacting chromatin structure. These modifications underpin the molecular mechanisms of cell-type-specific gene expression and complex diseases. However, annotating HMs across different cell types solely using experimental approaches is impractical due to cost and time constraints. Herein, we present dHICA (deep histone imputation using chromatin accessibility), a novel deep learning framework that integrates DNA sequences and chromatin accessibility data to predict multiple HM tracks. Employing the transformer architecture alongside dilated convolutions, dHICA boasts an extensive receptive field and captures more cell-type-specific information. dHICA outperforms state-of-the-art baselines and achieves superior performance in cell-type-specific loci and gene elements, aligning with biological expectations. Furthermore, dHICA's imputations hold significant potential for downstream applications, including chromatin state segmentation and elucidating the functional implications of SNPs (Single Nucleotide Polymorphisms). In conclusion, dHICA serves as a valuable tool for advancing the understanding of chromatin dynamics, offering enhanced predictive capabilities and interpretability.
Assuntos
Cromatina , Histonas , Cromatina/metabolismo , Cromatina/genética , Histonas/metabolismo , Histonas/genética , Humanos , Polimorfismo de Nucleotídeo Único , Aprendizado Profundo , Biologia Computacional/métodos , Código das HistonasRESUMO
The spatial reconstruction of single-cell RNA sequencing (scRNA-seq) data into spatial transcriptomics (ST) is a rapidly evolving field that addresses the significant challenge of aligning gene expression profiles to their spatial origins within tissues. This task is complicated by the inherent batch effects and the need for precise gene expression characterization to accurately reflect spatial information. To address these challenges, we developed SELF-Former, a transformer-based framework that utilizes multi-scale structures to learn gene representations, while designing spatial correlation constraints for the reconstruction of corresponding ST data. SELF-Former excels in recovering the spatial information of ST data and effectively mitigates batch effects between scRNA-seq and ST data. A novel aspect of SELF-Former is the introduction of a gene filtration module, which significantly enhances the spatial reconstruction task by selecting genes that are crucial for accurate spatial positioning and reconstruction. The superior performance and effectiveness of SELF-Former's modules have been validated across four benchmark datasets, establishing it as a robust and effective method for spatial reconstruction tasks. SELF-Former demonstrates its capability to extract meaningful gene expression information from scRNA-seq data and accurately map it to the spatial context of real ST data. Our method represents a significant advancement in the field, offering a reliable approach for spatial reconstruction.
Assuntos
Análise de Célula Única , Análise de Célula Única/métodos , Análise de Sequência de RNA/métodos , Algoritmos , Perfilação da Expressão Gênica/métodos , Humanos , Biologia Computacional/métodos , Transcriptoma , SoftwareRESUMO
T-cell receptor (TCR) recognition of antigens is fundamental to the adaptive immune response. With the expansion of experimental techniques, a substantial database of matched TCR-antigen pairs has emerged, presenting opportunities for computational prediction models. However, accurately forecasting the binding affinities of unseen antigen-TCR pairs remains a major challenge. Here, we present convolutional-self-attention TCR (CATCR), a novel framework tailored to enhance the prediction of epitope and TCR interactions. Our approach utilizes convolutional neural networks to extract peptide features from residue contact matrices, as generated by OpenFold, and a transformer to encode segment-based coded sequences. We introduce CATCR-D, a discriminator that can assess binding by analyzing the structural and sequence features of epitopes and CDR3-ß regions. Additionally, the framework comprises CATCR-G, a generative module designed for CDR3-ß sequences, which applies the pretrained encoder to deduce epitope characteristics and a transformer decoder for predicting matching CDR3-ß sequences. CATCR-D achieved an AUROC of 0.89 on previously unseen epitope-TCR pairs and outperformed four benchmark models by a margin of 17.4%. CATCR-G has demonstrated high precision, recall and F1 scores, surpassing 95% in bidirectional encoder representations from transformers score assessments. Our results indicate that CATCR is an effective tool for predicting unseen epitope-TCR interactions. Incorporating structural insights enhances our understanding of the general rules governing TCR-epitope recognition significantly. The ability to predict TCRs for novel epitopes using structural and sequence information is promising, and broadening the repository of experimental TCR-epitope data could further improve the precision of epitope-TCR binding predictions.
Assuntos
Receptores de Antígenos de Linfócitos T , Receptores de Antígenos de Linfócitos T/química , Receptores de Antígenos de Linfócitos T/imunologia , Receptores de Antígenos de Linfócitos T/metabolismo , Receptores de Antígenos de Linfócitos T/genética , Humanos , Epitopos/química , Epitopos/imunologia , Biologia Computacional/métodos , Redes Neurais de Computação , Epitopos de Linfócito T/imunologia , Epitopos de Linfócito T/química , Antígenos/química , Antígenos/imunologia , Sequência de AminoácidosRESUMO
Discovering effective anti-tumor drug combinations is crucial for advancing cancer therapy. Taking full account of intricate biological interactions is highly important in accurately predicting drug synergy. However, the extremely limited prior knowledge poses great challenges in developing current computational methods. To address this, we introduce SynergyX, a multi-modality mutual attention network to improve anti-tumor drug synergy prediction. It dynamically captures cross-modal interactions, allowing for the modeling of complex biological networks and drug interactions. A convolution-augmented attention structure is adopted to integrate multi-omic data in this framework effectively. Compared with other state-of-the-art models, SynergyX demonstrates superior predictive accuracy in both the General Test and Blind Test and cross-dataset validation. By exhaustively screening combinations of approved drugs, SynergyX reveals its ability to identify promising drug combination candidates for potential lung cancer treatment. Another notable advantage lies in its multidimensional interpretability. Taking Sorafenib and Vorinostat as an example, SynergyX serves as a powerful tool for uncovering drug-gene interactions and deciphering cell selectivity mechanisms. In summary, SynergyX provides an illuminating and interpretable framework, poised to catalyze the expedition of drug synergy discovery and deepen our comprehension of rational combination therapy.
Assuntos
Descoberta de Drogas , Neoplasias Pulmonares , Humanos , Catálise , Terapia Combinada , Projetos de PesquisaRESUMO
Functional peptides play crucial roles in various biological processes and hold significant potential in many fields such as drug discovery and biotechnology. Accurately predicting the functions of peptides is essential for understanding their diverse effects and designing peptide-based therapeutics. Here, we propose CELA-MFP, a deep learning framework that incorporates feature Contrastive Enhancement and Label Adaptation for predicting Multi-Functional therapeutic Peptides. CELA-MFP utilizes a protein language model (pLM) to extract features from peptide sequences, which are then fed into a Transformer decoder for function prediction, effectively modeling correlations between different functions. To enhance the representation of each peptide sequence, contrastive learning is employed during training. Experimental results demonstrate that CELA-MFP outperforms state-of-the-art methods on most evaluation metrics for two widely used datasets, MFBP and MFTP. The interpretability of CELA-MFP is demonstrated by visualizing attention patterns in pLM and Transformer decoder. Finally, a user-friendly online server for predicting multi-functional peptides is established as the implementation of the proposed CELA-MFP and can be freely accessed at http://dreamai.cmii.online/CELA-MFP.
Assuntos
Aprendizado Profundo , Peptídeos , Peptídeos/química , Biologia Computacional/métodos , Software , Humanos , Algoritmos , Bases de Dados de ProteínasRESUMO
Recent advancements in spatial imaging technologies have revolutionized the acquisition of high-resolution multichannel images, gene expressions, and spatial locations at the single-cell level. Our study introduces xSiGra, an interpretable graph-based AI model, designed to elucidate interpretable features of identified spatial cell types, by harnessing multimodal features from spatial imaging technologies. By constructing a spatial cellular graph with immunohistology images and gene expression as node attributes, xSiGra employs hybrid graph transformer models to delineate spatial cell types. Additionally, xSiGra integrates a novel variant of gradient-weighted class activation mapping component to uncover interpretable features, including pivotal genes and cells for various cell types, thereby facilitating deeper biological insights from spatial data. Through rigorous benchmarking against existing methods, xSiGra demonstrates superior performance across diverse spatial imaging datasets. Application of xSiGra on a lung tumor slice unveils the importance score of cells, illustrating that cellular activity is not solely determined by itself but also impacted by neighboring cells. Moreover, leveraging the identified interpretable genes, xSiGra reveals endothelial cell subset interacting with tumor cells, indicating its heterogeneous underlying mechanisms within complex cellular interactions.
Assuntos
Análise de Célula Única , Análise de Célula Única/métodos , Humanos , Algoritmos , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/patologia , Neoplasias Pulmonares/metabolismo , Biologia Computacional/métodosRESUMO
The functional study of proteins is a critical task in modern biology, playing a pivotal role in understanding the mechanisms of pathogenesis, developing new drugs, and discovering novel drug targets. However, existing computational models for subcellular localization face significant challenges, such as reliance on known Gene Ontology (GO) annotation databases or overlooking the relationship between GO annotations and subcellular localization. To address these issues, we propose DeepMTC, an end-to-end deep learning-based multi-task collaborative training model. DeepMTC integrates the interrelationship between subcellular localization and the functional annotation of proteins, leveraging multi-task collaborative training to eliminate dependence on known GO databases. This strategy gives DeepMTC a distinct advantage in predicting newly discovered proteins without prior functional annotations. First, DeepMTC leverages pre-trained language model with high accuracy to obtain the 3D structure and sequence features of proteins. Additionally, it employs a graph transformer module to encode protein sequence features, addressing the problem of long-range dependencies in graph neural networks. Finally, DeepMTC uses a functional cross-attention mechanism to efficiently combine upstream learned functional features to perform the subcellular localization task. The experimental results demonstrate that DeepMTC outperforms state-of-the-art models in both protein function prediction and subcellular localization. Moreover, interpretability experiments revealed that DeepMTC can accurately identify the key residues and functional domains of proteins, confirming its superior performance. The code and dataset of DeepMTC are freely available at https://github.com/ghli16/DeepMTC.
Assuntos
Aprendizado Profundo , Proteínas , Proteínas/metabolismo , Biologia Computacional/métodos , Bases de Dados de Proteínas , Redes Neurais de Computação , Humanos , Ontologia GenéticaRESUMO
De novo peptide sequencing is a promising approach for novel peptide discovery, highlighting the performance improvements for the state-of-the-art models. The quality of mass spectra often varies due to unexpected missing of certain ions, presenting a significant challenge in de novo peptide sequencing. Here, we use a novel concept of complementary spectra to enhance ion information of the experimental spectrum and demonstrate it through conceptual and practical analyses. Afterward, we design suitable encoders to encode the experimental spectrum and the corresponding complementary spectrum and propose a de novo sequencing model $\pi$-HelixNovo based on the Transformer architecture. We first demonstrated that $\pi$-HelixNovo outperforms other state-of-the-art models using a series of comparative experiments. Then, we utilized $\pi$-HelixNovo to de novo gut metaproteome peptides for the first time. The results show $\pi$-HelixNovo increases the identification coverage and accuracy of gut metaproteome and enhances the taxonomic resolution of gut metaproteome. We finally trained a powerful $\pi$-HelixNovo utilizing a larger training dataset, and as expected, $\pi$-HelixNovo achieves unprecedented performance, even for peptide-spectrum matches with never-before-seen peptide sequences. We also use the powerful $\pi$-HelixNovo to identify antibody peptides and multi-enzyme cleavage peptides, and $\pi$-HelixNovo is highly robust in these applications. Our results demonstrate the effectivity of the complementary spectrum and take a significant step forward in de novo peptide sequencing.
Assuntos
Análise de Sequência de Proteína , Espectrometria de Massas em Tandem , Espectrometria de Massas em Tandem/métodos , Análise de Sequência de Proteína/métodos , Peptídeos , Sequência de Aminoácidos , Anticorpos , AlgoritmosRESUMO
Despite advanced diagnostics, 3%-5% of cases remain classified as cancer of unknown primary (CUP). DNA methylation, an important epigenetic feature, is essential for determining the origin of metastatic tumors. We presented PathMethy, a novel Transformer model integrated with functional categories and crosstalk of pathways, to accurately trace the origin of tumors in CUP samples based on DNA methylation. PathMethy outperformed seven competing methods in F1-score across nine cancer datasets and predicted accurately the molecular subtypes within nine primary tumor types. It not only excelled at tracing the origins of both primary and metastatic tumors but also demonstrated a high degree of agreement with previously diagnosed sites in cases of CUP. PathMethy provided biological insights by highlighting key pathways, functional categories, and their interactions. Using functional categories of pathways, we gained a global understanding of biological processes. For broader access, a user-friendly web server for researchers and clinicians is available at https://cup.pathmethy.com.
Assuntos
Metilação de DNA , Neoplasias , Humanos , Neoplasias/genética , Software , Inteligência Artificial , Biologia Computacional/métodos , Algoritmos , Epigênese GenéticaRESUMO
Increasing volumes of biomedical data are amassing in databases. Large-scale analyses of these data have wide-ranging applications in biology and medicine. Such analyses require tools to characterize and process entries at scale. However, existing tools, mainly centered on extracting predefined fields, often fail to comprehensively process database entries or correct evident errors-a task humans can easily perform. These tools also lack the ability to reason like domain experts, hindering their robustness and analytical depth. Recent advances with large language models (LLMs) provide a fundamentally new way to query databases. But while a tool such as ChatGPT is adept at answering questions about manually input records, challenges arise when scaling up this process. First, interactions with the LLM need to be automated. Second, limitations on input length may require a record pruning or summarization pre-processing step. Third, to behave reliably as desired, the LLM needs either well-designed, short, 'few-shot' examples, or fine-tuning based on a larger set of well-curated examples. Here, we report ChIP-GPT, based on fine-tuning of the generative pre-trained transformer (GPT) model Llama and on a program prompting the model iteratively and handling its generation of answer text. This model is designed to extract metadata from the Sequence Read Archive, emphasizing the identification of chromatin immunoprecipitation (ChIP) targets and cell lines. When trained with 100 examples, ChIP-GPT demonstrates 90-94% accuracy. Notably, it can seamlessly extract data from records with typos or absent field labels. Our proposed method is easily adaptable to customized questions and different databases.