RESUMO
Throughout the body, T cells monitor MHC-bound ligands expressed on the surface of essentially all cell types. MHC ligands that trigger a T cell immune response are referred to as T cell epitopes. Identifying such epitopes enables tracking, phenotyping, and stimulating T cells involved in immune responses in infectious disease, allergy, autoimmunity, transplantation, and cancer. The specific T cell epitopes recognized in an individual are determined by genetic factors such as the MHC molecules the individual expresses, in parallel to the individual's environmental exposure history. The complexity and importance of T cell epitope mapping have motivated the development of computational approaches that predict what T cell epitopes are likely to be recognized in a given individual or in a broader population. Such predictions guide experimental epitope mapping studies and enable computational analysis of the immunogenic potential of a given protein sequence region.
Assuntos
Epitopos de Linfócito T/imunologia , Linfócitos T/imunologia , Linfócitos T/metabolismo , Animais , Biomarcadores , Biologia Computacional/métodos , Suscetibilidade a Doenças , Antígenos de Histocompatibilidade/imunologia , Humanos , Ligantes , Aprendizado de Máquina , Ligação ProteicaRESUMO
Heritability is essential for understanding the biological causes of disease but requires laborious patient recruitment and phenotype ascertainment. Electronic health records (EHRs) passively capture a wide range of clinically relevant data and provide a resource for studying the heritability of traits that are not typically accessible. EHRs contain next-of-kin information collected via patient emergency contact forms, but until now, these data have gone unused in research. We mined emergency contact data at three academic medical centers and identified 7.4 million familial relationships while maintaining patient privacy. Identified relationships were consistent with genetically derived relatedness. We used EHR data to compute heritability estimates for 500 disease phenotypes. Overall, estimates were consistent with the literature and between sites. Inconsistencies were indicative of limitations and opportunities unique to EHR research. These analyses provide a validation of the use of EHRs for genetics and disease research.
Assuntos
Registros Eletrônicos de Saúde , Doenças Genéticas Inatas/genética , Algoritmos , Bases de Dados Factuais , Relações Familiares , Doenças Genéticas Inatas/patologia , Genótipo , Humanos , Linhagem , Fenótipo , Característica Quantitativa HerdávelRESUMO
Bioinformatics has become an interdisciplinary subject due to its universal role in molecular biology research. The current status of Russia's bioinformatics research in Russia is not known. Here, we review the history of bioinformatics in Russia, present the current landscape, and highlight future directions and challenges. Bioinformatics research in Russia is driven by four major industries: information technology, pharmaceuticals, biotechnology, and agriculture. Over the past three decades, despite a delayed start, the field has gained momentum, especially in protein and nucleic acid research. Dedicated and shared centers for genomics, proteomics, and bioinformatics are active in different regions of Russia. Present-day bioinformatics in Russia is characterized by research issues related to genetics, metagenomics, OMICs, medical informatics, computational biology, environmental informatics, and structural bioinformatics. Notable developments are in the fields of software (tools, algorithms, and pipelines), use of high computation power (e.g. by the Siberian Supercomputer Center), and large-scale sequencing projects (the sequencing of 100 000 human genomes). Government funding is increasing, policies are being changed, and a National Genomic Information Database is being established. An increased focus on eukaryotic genome sequencing, the development of a common place for developers and researchers to share tools and data, and the use of biological modeling, machine learning, and biostatistics are key areas for future focus. Universities and research institutes have started to implement bioinformatics modules. A critical mass of bioinformaticians is essential to catch up with the global pace in the discipline.
Assuntos
Biologia Computacional , Biologia Computacional/métodos , Federação Russa , Humanos , História do Século XXI , História do Século XX , GenômicaRESUMO
This opinion article addresses a major issue in molecular biology and drug discovery by highlighting the complications that arise from combining polyproteins and their functional products within the same database entry. This problem, exemplified by the discovery of novel inhibitors for the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) main protease, has an influence on our ability to retrieve precise data and hinders the development of targeted therapies. It also emphasizes the need for improved database practices and underscores their significance in advancing scientific research. Furthermore, it emphasizes the need of learning from the SARS-CoV-2 pandemic in order to improve global preparedness for future health crises.
Assuntos
COVID-19 , Humanos , Poliproteínas/metabolismo , Cisteína Endopeptidases/metabolismo , SARS-CoV-2/metabolismo , Descoberta de Drogas , Simulação de Acoplamento MolecularRESUMO
N6-methyladenosine (m$^{6}$A) is a widely-studied methylation to messenger RNAs, which has been linked to diverse cellular processes and human diseases. Numerous databases that collate m$^{6}$A profiles of distinct cell types have been created to facilitate quick and easy mining of m$^{6}$A signatures associated with cell-specific phenotypes. However, these databases contain inherent complexities that have not been explicitly reported, which may lead to inaccurate identification and interpretation of m$^{6}$A-associated biology by end-users who are unaware of them. Here, we review various m$^{6}$A-related databases, and highlight several critical matters. In particular, differences in peak-calling pipelines across databases drive substantial variability in both peak number and coordinates with only moderate reproducibility, and the inclusion of peak calls from early m$^{6}$A sequencing protocols may lead to the reporting of false positives or negatives. The awareness of these matters will help end-users avoid the inclusion of potentially unreliable data in their studies and better utilize m$^{6}$A databases to derive biologically meaningful results.
Assuntos
Adenosina , Humanos , Adenosina/análogos & derivados , Adenosina/genética , Adenosina/metabolismo , Bases de Dados Genéticas , RNA Mensageiro/genética , RNA Mensageiro/metabolismoRESUMO
Increasing volumes of biomedical data are amassing in databases. Large-scale analyses of these data have wide-ranging applications in biology and medicine. Such analyses require tools to characterize and process entries at scale. However, existing tools, mainly centered on extracting predefined fields, often fail to comprehensively process database entries or correct evident errors-a task humans can easily perform. These tools also lack the ability to reason like domain experts, hindering their robustness and analytical depth. Recent advances with large language models (LLMs) provide a fundamentally new way to query databases. But while a tool such as ChatGPT is adept at answering questions about manually input records, challenges arise when scaling up this process. First, interactions with the LLM need to be automated. Second, limitations on input length may require a record pruning or summarization pre-processing step. Third, to behave reliably as desired, the LLM needs either well-designed, short, 'few-shot' examples, or fine-tuning based on a larger set of well-curated examples. Here, we report ChIP-GPT, based on fine-tuning of the generative pre-trained transformer (GPT) model Llama and on a program prompting the model iteratively and handling its generation of answer text. This model is designed to extract metadata from the Sequence Read Archive, emphasizing the identification of chromatin immunoprecipitation (ChIP) targets and cell lines. When trained with 100 examples, ChIP-GPT demonstrates 90-94% accuracy. Notably, it can seamlessly extract data from records with typos or absent field labels. Our proposed method is easily adaptable to customized questions and different databases.
Assuntos
Medicina , Humanos , Linhagem Celular , Imunoprecipitação da Cromatina , Bases de Dados Factuais , IdiomaRESUMO
With the rapid development of human intestinal microbiology and diverse microbiome-related studies and investigations, a large amount of data have been generated and accumulated. Meanwhile, different computational and bioinformatics models have been developed for pattern recognition and knowledge discovery using these data. Given the heterogeneity of these resources and models, we aimed to provide a landscape of the data resources, a comparison of the computational models and a summary of the translational informatics applied to microbiota data. We first review the existing databases, knowledge bases, knowledge graphs and standardizations of microbiome data. Then, the high-throughput sequencing techniques for the microbiome and the informatics tools for their analyses are compared. Finally, translational informatics for the microbiome, including biomarker discovery, personalized treatment and smart healthcare for complex diseases, are discussed.
Assuntos
Pesquisa Biomédica , Informática Médica , Humanos , Genômica/métodos , Biologia Computacional/métodos , Pesquisa Translacional BiomédicaRESUMO
Complex biological processes in cells are embedded in the interactome, representing the complete set of protein-protein interactions. Mapping and analyzing the protein structures are essential to fully comprehending these processes' molecular details. Therefore, knowing the structural coverage of the interactome is important to show the current limitations. Structural modeling of protein-protein interactions requires accurate protein structures. In this study, we mapped all experimental structures to the reference human proteome. Later, we found the enrichment in structural coverage when complementary methods such as homology modeling and deep learning (AlphaFold) were included. We then collected the interactions from the literature and databases to form the reference human interactome, resulting in 117 897 non-redundant interactions. When we analyzed the structural coverage of the interactome, we found that the number of experimentally determined protein complex structures is scarce, corresponding to 3.95% of all binary interactions. We also analyzed known and modeled structures to potentially construct the structural interactome with a docking method. Our analysis showed that 12.97% of the interactions from HuRI and 73.62% and 32.94% from the filtered versions of STRING and HIPPIE could potentially be modeled with high structural coverage or accuracy, respectively. Overall, this paper provides an overview of the current state of structural coverage of the human proteome and interactome.