Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 2.435
Filtrar
Más filtros

Tipo del documento
Intervalo de año de publicación
1.
Cell ; 186(7): 1493-1511.e40, 2023 03 30.
Artículo en Inglés | MEDLINE | ID: mdl-37001506

RESUMEN

Understanding how genetic variants impact molecular phenotypes is a key goal of functional genomics, currently hindered by reliance on a single haploid reference genome. Here, we present the EN-TEx resource of 1,635 open-access datasets from four donors (∼30 tissues × âˆ¼15 assays). The datasets are mapped to matched, diploid genomes with long-read phasing and structural variants, instantiating a catalog of >1 million allele-specific loci. These loci exhibit coordinated activity along haplotypes and are less conserved than corresponding, non-allele-specific ones. Surprisingly, a deep-learning transformer model can predict the allele-specific activity based only on local nucleotide-sequence context, highlighting the importance of transcription-factor-binding motifs particularly sensitive to variants. Furthermore, combining EN-TEx with existing genome annotations reveals strong associations between allele-specific and GWAS loci. It also enables models for transferring known eQTLs to difficult-to-profile tissues (e.g., from skin to heart). Overall, EN-TEx provides rich data and generalizable models for more accurate personal functional genomics.


Asunto(s)
Epigenoma , Sitios de Carácter Cuantitativo , Estudio de Asociación del Genoma Completo , Genómica , Fenotipo , Polimorfismo de Nucleótido Simple
2.
Proc Natl Acad Sci U S A ; 121(27): e2311807121, 2024 Jul 02.
Artículo en Inglés | MEDLINE | ID: mdl-38913893

RESUMEN

Machine learning has been proposed as an alternative to theoretical modeling when dealing with complex problems in biological physics. However, in this perspective, we argue that a more successful approach is a proper combination of these two methodologies. We discuss how ideas coming from physical modeling neuronal processing led to early formulations of computational neural networks, e.g., Hopfield networks. We then show how modern learning approaches like Potts models, Boltzmann machines, and the transformer architecture are related to each other, specifically, through a shared energy representation. We summarize recent efforts to establish these connections and provide examples on how each of these formulations integrating physical modeling and machine learning have been successful in tackling recent problems in biomolecular structure, dynamics, function, evolution, and design. Instances include protein structure prediction; improvement in computational complexity and accuracy of molecular dynamics simulations; better inference of the effects of mutations in proteins leading to improved evolutionary modeling and finally how machine learning is revolutionizing protein engineering and design. Going beyond naturally existing protein sequences, a connection to protein design is discussed where synthetic sequences are able to fold to naturally occurring motifs driven by a model rooted in physical principles. We show that this model is "learnable" and propose its future use in the generation of unique sequences that can fold into a target structure.


Asunto(s)
Aprendizaje Automático , Redes Neurales de la Computación , Proteínas , Proteínas/química , Proteínas/metabolismo , Ingeniería de Proteínas/métodos , Simulación de Dinámica Molecular
3.
Proc Natl Acad Sci U S A ; 121(3): e2308812120, 2024 Jan 16.
Artículo en Inglés | MEDLINE | ID: mdl-38190540

RESUMEN

Aging in an individual refers to the temporal change, mostly decline, in the body's ability to meet physiological demands. Biological age (BA) is a biomarker of chronological aging and can be used to stratify populations to predict certain age-related chronic diseases. BA can be predicted from biomedical features such as brain MRI, retinal, or facial images, but the inherent heterogeneity in the aging process limits the usefulness of BA predicted from individual body systems. In this paper, we developed a multimodal Transformer-based architecture with cross-attention which was able to combine facial, tongue, and retinal images to estimate BA. We trained our model using facial, tongue, and retinal images from 11,223 healthy subjects and demonstrated that using a fusion of the three image modalities achieved the most accurate BA predictions. We validated our approach on a test population of 2,840 individuals with six chronic diseases and obtained significant difference between chronological age and BA (AgeDiff) than that of healthy subjects. We showed that AgeDiff has the potential to be utilized as a standalone biomarker or conjunctively alongside other known factors for risk stratification and progression prediction of chronic diseases. Our results therefore highlight the feasibility of using multimodal images to estimate and interrogate the aging process.


Asunto(s)
Envejecimiento , Suministros de Energía Eléctrica , Humanos , Cara , Biomarcadores , Enfermedad Crónica
4.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38725156

RESUMEN

Protein acetylation is one of the extensively studied post-translational modifications (PTMs) due to its significant roles across a myriad of biological processes. Although many computational tools for acetylation site identification have been developed, there is a lack of benchmark dataset and bespoke predictors for non-histone acetylation site prediction. To address these problems, we have contributed to both dataset creation and predictor benchmark in this study. First, we construct a non-histone acetylation site benchmark dataset, namely NHAC, which includes 11 subsets according to the sequence length ranging from 11 to 61 amino acids. There are totally 886 positive samples and 4707 negative samples for each sequence length. Secondly, we propose TransPTM, a transformer-based neural network model for non-histone acetylation site predication. During the data representation phase, per-residue contextualized embeddings are extracted using ProtT5 (an existing pre-trained protein language model). This is followed by the implementation of a graph neural network framework, which consists of three TransformerConv layers for feature extraction and a multilayer perceptron module for classification. The benchmark results reflect that TransPTM has the competitive performance for non-histone acetylation site prediction over three state-of-the-art tools. It improves our comprehension on the PTM mechanism and provides a theoretical basis for developing drug targets for diseases. Moreover, the created PTM datasets fills the gap in non-histone acetylation site datasets and is beneficial to the related communities. The related source code and data utilized by TransPTM are accessible at https://www.github.com/TransPTM/TransPTM.


Asunto(s)
Redes Neurales de la Computación , Procesamiento Proteico-Postraduccional , Acetilación , Biología Computacional/métodos , Bases de Datos de Proteínas , Programas Informáticos , Algoritmos , Humanos , Proteínas/química , Proteínas/metabolismo
5.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38701415

RESUMEN

N4-acetylcytidine (ac4C) is a modification found in ribonucleic acid (RNA) related to diseases. Expensive and labor-intensive methods hindered the exploration of ac4C mechanisms and the development of specific anti-ac4C drugs. Therefore, an advanced prediction model for ac4C in RNA is urgently needed. Despite the construction of various prediction models, several limitations exist: (1) insufficient resolution at base level for ac4C sites; (2) lack of information on species other than Homo sapiens; (3) lack of information on RNA other than mRNA; and (4) lack of interpretation for each prediction. In light of these limitations, we have reconstructed the previous benchmark dataset and introduced a new dataset including balanced RNA sequences from multiple species and RNA types, while also providing base-level resolution for ac4C sites. Additionally, we have proposed a novel transformer-based architecture and pipeline for predicting ac4C sites, allowing for highly accurate predictions, visually interpretable results and no restrictions on the length of input RNA sequences. Statistically, our work has improved the accuracy of predicting specific ac4C sites in multiple species from less than 40% to around 85%, achieving a high AUC > 0.9. These results significantly surpass the performance of all existing models.


Asunto(s)
Citidina , Citidina/análogos & derivados , ARN , Citidina/genética , ARN/genética , ARN/química , Humanos , Biología Computacional/métodos , Animales , Programas Informáticos , Algoritmos
6.
Brief Bioinform ; 25(4)2024 May 23.
Artículo en Inglés | MEDLINE | ID: mdl-38990515

RESUMEN

Accurate prediction of molecular properties is fundamental in drug discovery and development, providing crucial guidance for effective drug design. A critical factor in achieving accurate molecular property prediction lies in the appropriate representation of molecular structures. Presently, prevalent deep learning-based molecular representations rely on 2D structure information as the primary molecular representation, often overlooking essential three-dimensional (3D) conformational information due to the inherent limitations of 2D structures in conveying atomic spatial relationships. In this study, we propose employing the Gram matrix as a condensed representation of 3D molecular structures and for efficient pretraining objectives. Subsequently, we leverage this matrix to construct a novel molecular representation model, Pre-GTM, which inherently encapsulates 3D information. The model accurately predicts the 3D structure of a molecule by estimating the Gram matrix. Our findings demonstrate that Pre-GTM model outperforms the baseline Graphormer model and other pretrained models in the QM9 and MoleculeNet quantitative property prediction task. The integration of the Gram matrix as a condensed representation of 3D molecular structure, incorporated into the Pre-GTM model, opens up promising avenues for its potential application across various domains of molecular research, including drug design, materials science, and chemical engineering.


Asunto(s)
Conformación Molecular , Modelos Moleculares , Diseño de Fármacos , Aprendizaje Profundo , Descubrimiento de Drogas , Algoritmos
7.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38436557

RESUMEN

Spatial transcriptomics technologies have shed light on the complexities of tissue structures by accurately mapping spatial microenvironments. Nonetheless, a myriad of methods, especially those utilized in platforms like Visium, often relinquish spatial details owing to intrinsic resolution limitations. In response, we introduce TransformerST, an innovative, unsupervised model anchored in the Transformer architecture, which operates independently of references, thereby ensuring cost-efficiency by circumventing the need for single-cell RNA sequencing. TransformerST not only elevates Visium data from a multicellular level to a single-cell granularity but also showcases adaptability across diverse spatial transcriptomics platforms. By employing a vision transformer-based encoder, it discerns latent image-gene expression co-representations and is further enhanced by spatial correlations, derived from an adaptive graph Transformer module. The sophisticated cross-scale graph network, utilized in super-resolution, significantly boosts the model's accuracy, unveiling complex structure-functional relationships within histology images. Empirical evaluations validate its adeptness in revealing tissue subtleties at the single-cell scale. Crucially, TransformerST adeptly navigates through image-gene co-representation, maximizing the synergistic utility of gene expression and histology images, thereby emerging as a pioneering tool in spatial transcriptomics. It not only enhances resolution to a single-cell level but also introduces a novel approach that optimally utilizes histology images alongside gene expression, providing a refined lens for investigating spatial transcriptomics.


Asunto(s)
Perfilación de la Expresión Génica , Expresión Génica
8.
Brief Bioinform ; 25(5)2024 Jul 25.
Artículo en Inglés | MEDLINE | ID: mdl-39073829

RESUMEN

With the exponential growth of digital data, there is a pressing need for innovative storage media and techniques. DNA molecules, due to their stability, storage capacity, and density, offer a promising solution for information storage. However, DNA storage also faces numerous challenges, such as complex biochemical constraints and encoding efficiency. This paper presents Explorer, a high-efficiency DNA coding algorithm based on the De Bruijn graph, which leverages its capability to characterize local sequences. Explorer enables coding under various biochemical constraints, such as homopolymers, GC content, and undesired motifs. This paper also introduces Codeformer, a fast decoding algorithm based on the transformer architecture, to further enhance decoding efficiency. Numerical experiments indicate that, compared with other advanced algorithms, Explorer not only achieves stable encoding and decoding under various biochemical constraints but also increases the encoding efficiency and bit rate by ¿10%. Additionally, Codeformer demonstrates the ability to efficiently decode large quantities of DNA sequences. Under different parameter settings, its decoding efficiency exceeds that of traditional algorithms by more than two-fold. When Codeformer is combined with Reed-Solomon code, its decoding accuracy exceeds 99%, making it a good choice for high-speed decoding applications. These advancements are expected to contribute to the development of DNA-based data storage systems and the broader exploration of DNA as a novel information storage medium.


Asunto(s)
Algoritmos , ADN , ADN/genética , ADN/química , Programas Informáticos , Análisis de Secuencia de ADN/métodos , Biología Computacional/métodos
9.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38305453

RESUMEN

Target enrichment sequencing techniques are gaining widespread use in the field of genomics, prized for their economic efficiency and swift processing times. However, their success depends on the performance of probes and the evenness of sequencing depth among each probe. To accurately predict probe coverage depth, a model called Deqformer is proposed in this study. Deqformer utilizes the oligonucleotides sequence of each probe, drawing inspiration from Watson-Crick base pairing and incorporating two BERT encoders to capture the underlying information from the forward and reverse probe strands, respectively. The encoded data are combined with a feed-forward network to make precise predictions of sequencing depth. The performance of Deqformer is evaluated on four different datasets: SNP panel with 38 200 probes, lncRNA panel with 2000 probes, synthetic panel with 5899 probes and HD-Marker panel for Yesso scallop with 11 000 probes. The SNP and synthetic panels achieve impressive factor 3 of accuracy (F3acc) of 96.24% and 99.66% in 5-fold cross-validation. F3acc rates of over 87.33% and 72.56% are obtained when training on the SNP panel and evaluating performance on the lncRNA and HD-Marker datasets, respectively. Our analysis reveals that Deqformer effectively captures hybridization patterns, making it robust for accurate predictions in various scenarios. Deqformer leads to a novel perspective for probe design pipeline, aiming to enhance efficiency and effectiveness in probe design tasks.


Asunto(s)
Aprendizaje Profundo , ARN Largo no Codificante , Sondas de ADN/genética , Hibridación de Ácido Nucleico , Genómica
10.
Brief Bioinform ; 25(4)2024 May 23.
Artículo en Inglés | MEDLINE | ID: mdl-39038935

RESUMEN

Functional peptides play crucial roles in various biological processes and hold significant potential in many fields such as drug discovery and biotechnology. Accurately predicting the functions of peptides is essential for understanding their diverse effects and designing peptide-based therapeutics. Here, we propose CELA-MFP, a deep learning framework that incorporates feature Contrastive Enhancement and Label Adaptation for predicting Multi-Functional therapeutic Peptides. CELA-MFP utilizes a protein language model (pLM) to extract features from peptide sequences, which are then fed into a Transformer decoder for function prediction, effectively modeling correlations between different functions. To enhance the representation of each peptide sequence, contrastive learning is employed during training. Experimental results demonstrate that CELA-MFP outperforms state-of-the-art methods on most evaluation metrics for two widely used datasets, MFBP and MFTP. The interpretability of CELA-MFP is demonstrated by visualizing attention patterns in pLM and Transformer decoder. Finally, a user-friendly online server for predicting multi-functional peptides is established as the implementation of the proposed CELA-MFP and can be freely accessed at http://dreamai.cmii.online/CELA-MFP.


Asunto(s)
Aprendizaje Profundo , Péptidos , Péptidos/química , Biología Computacional/métodos , Programas Informáticos , Humanos , Algoritmos , Bases de Datos de Proteínas
11.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38711371

RESUMEN

T-cell receptor (TCR) recognition of antigens is fundamental to the adaptive immune response. With the expansion of experimental techniques, a substantial database of matched TCR-antigen pairs has emerged, presenting opportunities for computational prediction models. However, accurately forecasting the binding affinities of unseen antigen-TCR pairs remains a major challenge. Here, we present convolutional-self-attention TCR (CATCR), a novel framework tailored to enhance the prediction of epitope and TCR interactions. Our approach utilizes convolutional neural networks to extract peptide features from residue contact matrices, as generated by OpenFold, and a transformer to encode segment-based coded sequences. We introduce CATCR-D, a discriminator that can assess binding by analyzing the structural and sequence features of epitopes and CDR3-ß regions. Additionally, the framework comprises CATCR-G, a generative module designed for CDR3-ß sequences, which applies the pretrained encoder to deduce epitope characteristics and a transformer decoder for predicting matching CDR3-ß sequences. CATCR-D achieved an AUROC of 0.89 on previously unseen epitope-TCR pairs and outperformed four benchmark models by a margin of 17.4%. CATCR-G has demonstrated high precision, recall and F1 scores, surpassing 95% in bidirectional encoder representations from transformers score assessments. Our results indicate that CATCR is an effective tool for predicting unseen epitope-TCR interactions. Incorporating structural insights enhances our understanding of the general rules governing TCR-epitope recognition significantly. The ability to predict TCRs for novel epitopes using structural and sequence information is promising, and broadening the repository of experimental TCR-epitope data could further improve the precision of epitope-TCR binding predictions.


Asunto(s)
Receptores de Antígenos de Linfocitos T , Receptores de Antígenos de Linfocitos T/química , Receptores de Antígenos de Linfocitos T/inmunología , Receptores de Antígenos de Linfocitos T/metabolismo , Receptores de Antígenos de Linfocitos T/genética , Humanos , Epítopos/química , Epítopos/inmunología , Biología Computacional/métodos , Redes Neurales de la Computación , Epítopos de Linfocito T/inmunología , Epítopos de Linfocito T/química , Antígenos/química , Antígenos/inmunología , Secuencia de Aminoácidos
12.
Brief Bioinform ; 25(6)2024 Sep 23.
Artículo en Inglés | MEDLINE | ID: mdl-39316943

RESUMEN

Histone modifications (HMs) are pivotal in various biological processes, including transcription, replication, and DNA repair, significantly impacting chromatin structure. These modifications underpin the molecular mechanisms of cell-type-specific gene expression and complex diseases. However, annotating HMs across different cell types solely using experimental approaches is impractical due to cost and time constraints. Herein, we present dHICA (deep histone imputation using chromatin accessibility), a novel deep learning framework that integrates DNA sequences and chromatin accessibility data to predict multiple HM tracks. Employing the transformer architecture alongside dilated convolutions, dHICA boasts an extensive receptive field and captures more cell-type-specific information. dHICA outperforms state-of-the-art baselines and achieves superior performance in cell-type-specific loci and gene elements, aligning with biological expectations. Furthermore, dHICA's imputations hold significant potential for downstream applications, including chromatin state segmentation and elucidating the functional implications of SNPs (Single Nucleotide Polymorphisms). In conclusion, dHICA serves as a valuable tool for advancing the understanding of chromatin dynamics, offering enhanced predictive capabilities and interpretability.


Asunto(s)
Cromatina , Histonas , Cromatina/metabolismo , Cromatina/genética , Histonas/metabolismo , Histonas/genética , Humanos , Polimorfismo de Nucleótido Simple , Aprendizaje Profundo , Biología Computacional/métodos , Código de Histonas
13.
Brief Bioinform ; 25(5)2024 Jul 25.
Artículo en Inglés | MEDLINE | ID: mdl-39120644

RESUMEN

Recent advancements in spatial imaging technologies have revolutionized the acquisition of high-resolution multichannel images, gene expressions, and spatial locations at the single-cell level. Our study introduces xSiGra, an interpretable graph-based AI model, designed to elucidate interpretable features of identified spatial cell types, by harnessing multimodal features from spatial imaging technologies. By constructing a spatial cellular graph with immunohistology images and gene expression as node attributes, xSiGra employs hybrid graph transformer models to delineate spatial cell types. Additionally, xSiGra integrates a novel variant of gradient-weighted class activation mapping component to uncover interpretable features, including pivotal genes and cells for various cell types, thereby facilitating deeper biological insights from spatial data. Through rigorous benchmarking against existing methods, xSiGra demonstrates superior performance across diverse spatial imaging datasets. Application of xSiGra on a lung tumor slice unveils the importance score of cells, illustrating that cellular activity is not solely determined by itself but also impacted by neighboring cells. Moreover, leveraging the identified interpretable genes, xSiGra reveals endothelial cell subset interacting with tumor cells, indicating its heterogeneous underlying mechanisms within complex cellular interactions.


Asunto(s)
Análisis de la Célula Individual , Análisis de la Célula Individual/métodos , Humanos , Algoritmos , Neoplasias Pulmonares/genética , Neoplasias Pulmonares/patología , Neoplasias Pulmonares/metabolismo , Biología Computacional/métodos
14.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38340091

RESUMEN

Discovering effective anti-tumor drug combinations is crucial for advancing cancer therapy. Taking full account of intricate biological interactions is highly important in accurately predicting drug synergy. However, the extremely limited prior knowledge poses great challenges in developing current computational methods. To address this, we introduce SynergyX, a multi-modality mutual attention network to improve anti-tumor drug synergy prediction. It dynamically captures cross-modal interactions, allowing for the modeling of complex biological networks and drug interactions. A convolution-augmented attention structure is adopted to integrate multi-omic data in this framework effectively. Compared with other state-of-the-art models, SynergyX demonstrates superior predictive accuracy in both the General Test and Blind Test and cross-dataset validation. By exhaustively screening combinations of approved drugs, SynergyX reveals its ability to identify promising drug combination candidates for potential lung cancer treatment. Another notable advantage lies in its multidimensional interpretability. Taking Sorafenib and Vorinostat as an example, SynergyX serves as a powerful tool for uncovering drug-gene interactions and deciphering cell selectivity mechanisms. In summary, SynergyX provides an illuminating and interpretable framework, poised to catalyze the expedition of drug synergy discovery and deepen our comprehension of rational combination therapy.


Asunto(s)
Descubrimiento de Drogas , Neoplasias Pulmonares , Humanos , Catálisis , Terapia Combinada , Proyectos de Investigación
15.
Brief Bioinform ; 25(6)2024 Sep 23.
Artículo en Inglés | MEDLINE | ID: mdl-39391931

RESUMEN

Despite advanced diagnostics, 3%-5% of cases remain classified as cancer of unknown primary (CUP). DNA methylation, an important epigenetic feature, is essential for determining the origin of metastatic tumors. We presented PathMethy, a novel Transformer model integrated with functional categories and crosstalk of pathways, to accurately trace the origin of tumors in CUP samples based on DNA methylation. PathMethy outperformed seven competing methods in F1-score across nine cancer datasets and predicted accurately the molecular subtypes within nine primary tumor types. It not only excelled at tracing the origins of both primary and metastatic tumors but also demonstrated a high degree of agreement with previously diagnosed sites in cases of CUP. PathMethy provided biological insights by highlighting key pathways, functional categories, and their interactions. Using functional categories of pathways, we gained a global understanding of biological processes. For broader access, a user-friendly web server for researchers and clinicians is available at https://cup.pathmethy.com.


Asunto(s)
Metilación de ADN , Neoplasias , Humanos , Neoplasias/genética , Programas Informáticos , Inteligencia Artificial , Biología Computacional/métodos , Algoritmos , Epigénesis Genética
16.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38340092

RESUMEN

De novo peptide sequencing is a promising approach for novel peptide discovery, highlighting the performance improvements for the state-of-the-art models. The quality of mass spectra often varies due to unexpected missing of certain ions, presenting a significant challenge in de novo peptide sequencing. Here, we use a novel concept of complementary spectra to enhance ion information of the experimental spectrum and demonstrate it through conceptual and practical analyses. Afterward, we design suitable encoders to encode the experimental spectrum and the corresponding complementary spectrum and propose a de novo sequencing model $\pi$-HelixNovo based on the Transformer architecture. We first demonstrated that $\pi$-HelixNovo outperforms other state-of-the-art models using a series of comparative experiments. Then, we utilized $\pi$-HelixNovo to de novo gut metaproteome peptides for the first time. The results show $\pi$-HelixNovo increases the identification coverage and accuracy of gut metaproteome and enhances the taxonomic resolution of gut metaproteome. We finally trained a powerful $\pi$-HelixNovo utilizing a larger training dataset, and as expected, $\pi$-HelixNovo achieves unprecedented performance, even for peptide-spectrum matches with never-before-seen peptide sequences. We also use the powerful $\pi$-HelixNovo to identify antibody peptides and multi-enzyme cleavage peptides, and $\pi$-HelixNovo is highly robust in these applications. Our results demonstrate the effectivity of the complementary spectrum and take a significant step forward in de novo peptide sequencing.


Asunto(s)
Análisis de Secuencia de Proteína , Espectrometría de Masas en Tándem , Espectrometría de Masas en Tándem/métodos , Análisis de Secuencia de Proteína/métodos , Péptidos , Secuencia de Aminoácidos , Anticuerpos , Algoritmos
17.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38314912

RESUMEN

Increasing volumes of biomedical data are amassing in databases. Large-scale analyses of these data have wide-ranging applications in biology and medicine. Such analyses require tools to characterize and process entries at scale. However, existing tools, mainly centered on extracting predefined fields, often fail to comprehensively process database entries or correct evident errors-a task humans can easily perform. These tools also lack the ability to reason like domain experts, hindering their robustness and analytical depth. Recent advances with large language models (LLMs) provide a fundamentally new way to query databases. But while a tool such as ChatGPT is adept at answering questions about manually input records, challenges arise when scaling up this process. First, interactions with the LLM need to be automated. Second, limitations on input length may require a record pruning or summarization pre-processing step. Third, to behave reliably as desired, the LLM needs either well-designed, short, 'few-shot' examples, or fine-tuning based on a larger set of well-curated examples. Here, we report ChIP-GPT, based on fine-tuning of the generative pre-trained transformer (GPT) model Llama and on a program prompting the model iteratively and handling its generation of answer text. This model is designed to extract metadata from the Sequence Read Archive, emphasizing the identification of chromatin immunoprecipitation (ChIP) targets and cell lines. When trained with 100 examples, ChIP-GPT demonstrates 90-94% accuracy. Notably, it can seamlessly extract data from records with typos or absent field labels. Our proposed method is easily adaptable to customized questions and different databases.


Asunto(s)
Medicina , Humanos , Línea Celular , Inmunoprecipitación de Cromatina , Bases de Datos Factuales , Lenguaje
18.
Proc Natl Acad Sci U S A ; 120(31): e2305273120, 2023 08.
Artículo en Inglés | MEDLINE | ID: mdl-37487072

RESUMEN

Spider webs are incredible biological structures, comprising thin but strong silk filament and arranged into complex hierarchical architectures with striking mechanical properties (e.g., lightweight but high strength, achieving diverse mechanical responses). While simple 2D orb webs can easily be mimicked, the modeling and synthesis of 3D-based web structures remain challenging, partly due to the rich set of design features. Here, we provide a detailed analysis of the heterogeneous graph structures of spider webs and use deep learning as a way to model and then synthesize artificial, bioinspired 3D web structures. The generative models are conditioned based on key geometric parameters (including average edge length, number of nodes, average node degree, and others). To identify graph construction principles, we use inductive representation sampling of large experimentally determined spider web graphs, to yield a dataset that is used to train three conditional generative models: 1) an analog diffusion model inspired by nonequilibrium thermodynamics, with sparse neighbor representation; 2) a discrete diffusion model with full neighbor representation; and 3) an autoregressive transformer architecture with full neighbor representation. All three models are scalable, produce complex, de novo bioinspired spider web mimics, and successfully construct graphs that meet the design objectives. We further propose an algorithm that assembles web samples produced by the generative models into larger-scale structures based on a series of geometric design targets, including helical and parametric shapes, mimicking, and extending natural design principles toward integration with diverging engineering objectives. Several webs are manufactured using 3D printing and tested to assess mechanical properties.


Asunto(s)
Aprendizaje Profundo , Arañas , Animales , Algoritmos , Comercio , Citoesqueleto
19.
Brief Bioinform ; 24(6)2023 09 22.
Artículo en Inglés | MEDLINE | ID: mdl-37965809

RESUMEN

MOTIVATION: Bacteriophages (phages for short), which prey on and replicate within bacterial cells, have a significant role in modulating microbial communities and hold potential applications in treating antibiotic resistance. The advancement of high-throughput sequencing technology contributes to the discovery of phages tremendously. However, the taxonomic classification of assembled phage contigs still faces several challenges, including high genetic diversity, lack of a stable taxonomy system and limited knowledge of phage annotations. Despite extensive efforts, existing tools have not yet achieved an optimal balance between prediction rate and accuracy. RESULTS: In this work, we develop a learning-based model named PhaGenus, which conducts genus-level taxonomic classification for phage contigs. PhaGenus utilizes a powerful Transformer model to learn the association between protein clusters and support the classification of up to 508 genera. We tested PhaGenus on four datasets in different scenarios. The experimental results show that PhaGenus outperforms state-of-the-art methods in predicting low-similarity datasets, achieving an improvement of at least 13.7%. Additionally, PhaGenus is highly effective at identifying previously uncharacterized genera that are not represented in reference databases, with an improvement of 8.52%. The analysis of the infants' gut and GOV2.0 dataset demonstrates that PhaGenus can be used to classify more contigs with higher accuracy.


Asunto(s)
Bacteriófagos , Microbiota , Humanos , Bacteriófagos/genética , Secuenciación de Nucleótidos de Alto Rendimiento
20.
Brief Bioinform ; 24(6)2023 09 22.
Artículo en Inglés | MEDLINE | ID: mdl-37870286

RESUMEN

The advanced language models have enabled us to recognize protein-protein interactions (PPIs) and interaction sites using protein sequences or structures. Here, we trained the MindSpore ProteinBERT (MP-BERT) model, a Bidirectional Encoder Representation from Transformers, using protein pairs as inputs, making it suitable for identifying PPIs and their respective interaction sites. The pretrained model (MP-BERT) was fine-tuned as MPB-PPI (MP-BERT on PPI) and demonstrated its superiority over the state-of-the-art models on diverse benchmark datasets for predicting PPIs. Moreover, the model's capability to recognize PPIs among various organisms was evaluated on multiple organisms. An amalgamated organism model was designed, exhibiting a high level of generalization across the majority of organisms and attaining an accuracy of 92.65%. The model was also customized to predict interaction site propensity by fine-tuning it with PPI site data as MPB-PPISP. Our method facilitates the prediction of both PPIs and their interaction sites, thereby illustrating the potency of transfer learning in dealing with the protein pair task.


Asunto(s)
Aprendizaje Automático , Proteínas , Proteínas/química , Secuencia de Aminoácidos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA