Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 3.837
Filtrar
Más filtros

Intervalo de año de publicación
1.
Immunity ; 2024 Aug 15.
Artículo en Inglés | MEDLINE | ID: mdl-39163866

RESUMEN

Despite decades of antibody research, it remains challenging to predict the specificity of an antibody solely based on its sequence. Two major obstacles are the lack of appropriate models and the inaccessibility of datasets for model training. In this study, we curated >5,000 influenza hemagglutinin (HA) antibodies by mining research publications and patents, which revealed many distinct sequence features between antibodies to HA head and stem domains. We then leveraged this dataset to develop a lightweight memory B cell language model (mBLM) for sequence-based antibody specificity prediction. Model explainability analysis showed that mBLM could identify key sequence features of HA stem antibodies. Additionally, by applying mBLM to HA antibodies with unknown epitopes, we discovered and experimentally validated many HA stem antibodies. Overall, this study not only advances our molecular understanding of the antibody response to the influenza virus but also provides a valuable resource for applying deep learning to antibody research.

2.
Cell ; 173(7): 1692-1704.e11, 2018 06 14.
Artículo en Inglés | MEDLINE | ID: mdl-29779949

RESUMEN

Heritability is essential for understanding the biological causes of disease but requires laborious patient recruitment and phenotype ascertainment. Electronic health records (EHRs) passively capture a wide range of clinically relevant data and provide a resource for studying the heritability of traits that are not typically accessible. EHRs contain next-of-kin information collected via patient emergency contact forms, but until now, these data have gone unused in research. We mined emergency contact data at three academic medical centers and identified 7.4 million familial relationships while maintaining patient privacy. Identified relationships were consistent with genetically derived relatedness. We used EHR data to compute heritability estimates for 500 disease phenotypes. Overall, estimates were consistent with the literature and between sites. Inconsistencies were indicative of limitations and opportunities unique to EHR research. These analyses provide a validation of the use of EHRs for genetics and disease research.


Asunto(s)
Registros Electrónicos de Salud , Enfermedades Genéticas Congénitas/genética , Algoritmos , Bases de Datos Factuales , Relaciones Familiares , Enfermedades Genéticas Congénitas/patología , Genotipo , Humanos , Linaje , Fenotipo , Carácter Cuantitativo Heredable
3.
Immunity ; 55(6): 1105-1117.e4, 2022 06 14.
Artículo en Inglés | MEDLINE | ID: mdl-35397794

RESUMEN

Global research to combat the COVID-19 pandemic has led to the isolation and characterization of thousands of human antibodies to the SARS-CoV-2 spike protein, providing an unprecedented opportunity to study the antibody response to a single antigen. Using the information derived from 88 research publications and 13 patents, we assembled a dataset of ∼8,000 human antibodies to the SARS-CoV-2 spike protein from >200 donors. By analyzing immunoglobulin V and D gene usages, complementarity-determining region H3 sequences, and somatic hypermutations, we demonstrated that the common (public) responses to different domains of the spike protein were quite different. We further used these sequences to train a deep-learning model to accurately distinguish between the human antibodies to SARS-CoV-2 spike protein and those to influenza hemagglutinin protein. Overall, this study provides an informative resource for antibody research and enhances our molecular understanding of public antibody responses.


Asunto(s)
COVID-19 , SARS-CoV-2 , Anticuerpos Neutralizantes , Anticuerpos Antivirales , Formación de Anticuerpos , Humanos , Pandemias , Glicoproteína de la Espiga del Coronavirus
4.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38493292

RESUMEN

Computational predictors of immunogenic peptides, or epitopes, are traditionally built based on data from a broad range of pathogens without consideration for taxonomic information. While this approach may be reasonable if one aims to develop one-size-fits-all models, it may be counterproductive if the proteins for which the model is expected to generalize are known to come from a specific subset of phylogenetically related pathogens. There is mounting evidence that, for these cases, taxon-specific models can outperform generalist ones, even when trained with substantially smaller amounts of data. In this comment, we provide some perspective on the current state of taxon-specific modelling for the prediction of linear B-cell epitopes, and the challenges faced when building and deploying these predictors.


Asunto(s)
Péptidos , Proteínas , Secuencia de Aminoácidos , Epítopos de Linfocito B
5.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38314912

RESUMEN

Increasing volumes of biomedical data are amassing in databases. Large-scale analyses of these data have wide-ranging applications in biology and medicine. Such analyses require tools to characterize and process entries at scale. However, existing tools, mainly centered on extracting predefined fields, often fail to comprehensively process database entries or correct evident errors-a task humans can easily perform. These tools also lack the ability to reason like domain experts, hindering their robustness and analytical depth. Recent advances with large language models (LLMs) provide a fundamentally new way to query databases. But while a tool such as ChatGPT is adept at answering questions about manually input records, challenges arise when scaling up this process. First, interactions with the LLM need to be automated. Second, limitations on input length may require a record pruning or summarization pre-processing step. Third, to behave reliably as desired, the LLM needs either well-designed, short, 'few-shot' examples, or fine-tuning based on a larger set of well-curated examples. Here, we report ChIP-GPT, based on fine-tuning of the generative pre-trained transformer (GPT) model Llama and on a program prompting the model iteratively and handling its generation of answer text. This model is designed to extract metadata from the Sequence Read Archive, emphasizing the identification of chromatin immunoprecipitation (ChIP) targets and cell lines. When trained with 100 examples, ChIP-GPT demonstrates 90-94% accuracy. Notably, it can seamlessly extract data from records with typos or absent field labels. Our proposed method is easily adaptable to customized questions and different databases.


Asunto(s)
Medicina , Humanos , Línea Celular , Inmunoprecipitación de Cromatina , Bases de Datos Factuales , Lenguaje
6.
RNA ; 29(12): 1896-1909, 2023 12.
Artículo en Inglés | MEDLINE | ID: mdl-37793790

RESUMEN

The characterization of the conformational landscape of the RNA backbone is rather complex due to the ability of RNA to assume a large variety of conformations. These backbone conformations can be depicted by pseudotorsional angles linking RNA backbone atoms, from which Ramachandran-like plots can be built. We explore here different definitions of these pseudotorsional angles, finding that the most accurate ones are the traditional η (eta) and θ (theta) angles, which represent the relative position of RNA backbone atoms P and C4'. We explore the distribution of η - θ in known experimental structures, comparing the pseudotorsional space generated with structures determined exclusively by one experimental technique. We found that the complete picture only appears when combining data from different sources. The maps provide a quite comprehensive representation of the RNA accessible space, which can be used in RNA-structural predictions. Finally, our results highlight that protein interactions lead to significant changes in the population of the η - θ space, pointing toward the role of induced-fit mechanisms in protein-RNA recognition.


Asunto(s)
Proteínas , ARN , ARN/genética , ARN/química , Proteínas/química , Conformación de Ácido Nucleico
7.
Brief Bioinform ; 24(4)2023 07 20.
Artículo en Inglés | MEDLINE | ID: mdl-37401369

RESUMEN

As the volume of protein sequence and structure data grows rapidly, the functions of the overwhelming majority of proteins cannot be experimentally determined. Automated annotation of protein function at a large scale is becoming increasingly important. Existing computational prediction methods are typically based on expanding the relatively small number of experimentally determined functions to large collections of proteins with various clues, including sequence homology, protein-protein interaction, gene co-expression, etc. Although there has been some progress in protein function prediction in recent years, the development of accurate and reliable solutions still has a long way to go. Here we exploit AlphaFold predicted three-dimensional structural information, together with other non-structural clues, to develop a large-scale approach termed PredGO to annotate Gene Ontology (GO) functions for proteins. We use a pre-trained language model, geometric vector perceptrons and attention mechanisms to extract heterogeneous features of proteins and fuse these features for function prediction. The computational results demonstrate that the proposed method outperforms other state-of-the-art approaches for predicting GO functions of proteins in terms of both coverage and accuracy. The improvement of coverage is because the number of structures predicted by AlphaFold is greatly increased, and on the other hand, PredGO can extensively use non-structural information for functional prediction. Moreover, we show that over 205 000 ($\sim $100%) entries in UniProt for human are annotated by PredGO, over 186 000 ($\sim $90%) of which are based on predicted structure. The webserver and database are available at http://predgo.denglab.org/.


Asunto(s)
Biología Computacional , Proteínas , Humanos , Biología Computacional/métodos , Proteínas/química , Secuencia de Aminoácidos , Redes Neurales de la Computación , Bases de Datos Factuales , Bases de Datos de Proteínas
8.
Proteomics ; 24(12-13): e2300105, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38458994

RESUMEN

Peptides have a plethora of activities in biological systems that can potentially be exploited biotechnologically. Several peptides are used clinically, as well as in industry and agriculture. The increase in available 'omics data has recently provided a large opportunity for mining novel enzymes, biosynthetic gene clusters, and molecules. While these data primarily consist of DNA sequences, other types of data provide important complementary information. Due to their size, the approaches proven successful at discovering novel proteins of canonical size cannot be naïvely applied to the discovery of peptides. Peptides can be encoded directly in the genome as short open reading frames (smORFs), or they can be derived from larger proteins by proteolysis. Both of these peptide classes pose challenges as simple methods for their prediction result in large numbers of false positives. Similarly, functional annotation of larger proteins, traditionally based on sequence similarity to infer orthology and then transferring functions between characterized proteins and uncharacterized ones, cannot be applied for short sequences. The use of these techniques is much more limited and alternative approaches based on machine learning are used instead. Here, we review the limitations of traditional methods as well as the alternative methods that have recently been developed for discovering novel bioactive peptides with a focus on prokaryotic genomes and metagenomes.


Asunto(s)
Biología Computacional , Péptidos , Proteómica , Metagenoma , Células Procariotas/química , Biología Computacional/métodos
9.
Proteomics ; 24(14): e2300280, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38742951

RESUMEN

Mass spectrometry proteomics data are typically evaluated against publicly available annotated sequences, but the proteogenomics approach is a useful alternative. A single genome is commonly utilized in custom proteomic and proteogenomic data analysis. We pose the question of whether utilizing numerous different genome assemblies in a search database would be beneficial. We reanalyzed raw data from the exoprotein fraction of four reference Enterobacterial Repetitive Intergenic Consensus (ERIC) I-IV genotypes of the honey bee bacterial pathogen Paenibacillus larvae and evaluated them against three reference databases (from NCBI-protein, RefSeq, and UniProt) together with an array of protein sequences generated by six-frame direct translation of 15 genome assemblies from GenBank. The wide search yielded 453 protein hits/groups, which UpSet analysis categorized into 50 groups based on the success of protein identification by the 18 database components. Nine hits that were not identified by a unique peptide were not considered for marker selection, which discarded the only protein that was not identified by the reference databases. We propose that the variability in successful identifications between genome assemblies is useful for marker mining. The results suggest that various strains of P. larvae can exhibit specific traits that set them apart from the established genotypes ERIC I-V.


Asunto(s)
Proteínas Bacterianas , Genoma Bacteriano , Paenibacillus larvae , Proteogenómica , Factores de Virulencia , Proteogenómica/métodos , Animales , Abejas/microbiología , Paenibacillus larvae/genética , Paenibacillus larvae/patogenicidad , Paenibacillus larvae/metabolismo , Factores de Virulencia/genética , Factores de Virulencia/metabolismo , Proteínas Bacterianas/genética , Proteínas Bacterianas/metabolismo , Genoma Bacteriano/genética , Bases de Datos de Proteínas , Proteómica/métodos
10.
BMC Bioinformatics ; 25(1): 23, 2024 Jan 12.
Artículo en Inglés | MEDLINE | ID: mdl-38216898

RESUMEN

BACKGROUND: With the exponential growth of high-throughput technologies, multiple pathway analysis methods have been proposed to estimate pathway activities from gene expression profiles. These pathway activity inference methods can be divided into two main categories: non-Topology-Based (non-TB) and Pathway Topology-Based (PTB) methods. Although some review and survey articles discussed the topic from different aspects, there is a lack of systematic assessment and comparisons on the robustness of these approaches. RESULTS: Thus, this study presents comprehensive robustness evaluations of seven widely used pathway activity inference methods using six cancer datasets based on two assessments. The first assessment seeks to investigate the robustness of pathway activity in pathway activity inference methods, while the second assessment aims to assess the robustness of risk-active pathways and genes predicted by these methods. The mean reproducibility power and total number of identified informative pathways and genes were evaluated. Based on the first assessment, the mean reproducibility power of pathway activity inference methods generally decreased as the number of pathway selections increased. Entropy-based Directed Random Walk (e-DRW) distinctly outperformed other methods in exhibiting the greatest reproducibility power across all cancer datasets. On the other hand, the second assessment shows that no methods provide satisfactory results across datasets. CONCLUSION: However, PTB methods generally appear to perform better in producing greater reproducibility power and identifying potential cancer markers compared to non-TB methods.


Asunto(s)
Neoplasias , Humanos , Reproducibilidad de los Resultados , Neoplasias/genética , Entropía , Expresión Génica
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA