Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 22.439
Filtrar
Más filtros

Colección CLAP
Intervalo de año de publicación
1.
Annu Rev Immunol ; 35: 337-370, 2017 04 26.
Artículo en Inglés | MEDLINE | ID: mdl-28142321

RESUMEN

Transcriptomics, the high-throughput characterization of RNAs, has been instrumental in defining pathogenic signatures in human autoimmunity and autoinflammation. It enabled the identification of new therapeutic targets in IFN-, IL-1- and IL-17-mediated diseases. Applied to immunomonitoring, transcriptomics is starting to unravel diagnostic and prognostic signatures that stratify patients, track molecular changes associated with disease activity, define personalized treatment strategies, and generally inform clinical practice. Herein, we review the use of transcriptomics to define mechanistic, diagnostic, and predictive signatures in human autoimmunity and autoinflammation. We discuss some of the analytical approaches applied to extract biological knowledge from high-dimensional data sets. Finally, we touch upon emerging applications of transcriptomics to study eQTLs, B and T cell repertoire diversity, and isoform usage.


Asunto(s)
Enfermedades Autoinmunes/diagnóstico , Inflamación/diagnóstico , Transcriptoma , Enfermedades Autoinmunes/inmunología , Conjuntos de Datos como Asunto , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Inflamación/inmunología , Almacenamiento y Recuperación de la Información , Terapia Molecular Dirigida , Monitorización Inmunológica , Pronóstico
2.
Cell ; 163(1): 202-17, 2015 Sep 24.
Artículo en Inglés | MEDLINE | ID: mdl-26388441

RESUMEN

Cancer cells acquire pathological phenotypes through accumulation of mutations that perturb signaling networks. However, global analysis of these events is currently limited. Here, we identify six types of network-attacking mutations (NAMs), including changes in kinase and SH2 modulation, network rewiring, and the genesis and extinction of phosphorylation sites. We developed a computational platform (ReKINect) to identify NAMs and systematically interpreted the exomes and quantitative (phospho-)proteomes of five ovarian cancer cell lines and the global cancer genome repository. We identified and experimentally validated several NAMs, including PKCγ M501I and PKD1 D665N, which encode specificity switches analogous to the appearance of kinases de novo within the kinome. We discover mutant molecular logic gates, a drift toward phospho-threonine signaling, weakening of phosphorylation motifs, and kinase-inactivating hotspots in cancer. Our method pinpoints functional NAMs, scales with the complexity of cancer genomes and cell signaling, and may enhance our capability to therapeutically target tumor-specific networks.


Asunto(s)
Neoplasias Ováricas/metabolismo , Proteínas Quinasas/genética , Proteínas Quinasas/metabolismo , Transducción de Señal , Femenino , Humanos , Almacenamiento y Recuperación de la Información , Modelos Moleculares , Mutación Puntual , Proteínas Quinasas/química , Programas Informáticos
3.
Nature ; 634(8035): 824-832, 2024 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-39443776

RESUMEN

DNA storage has shown potential to transcend current silicon-based data storage technologies in storage density, longevity and energy consumption1-3. However, writing large-scale data directly into DNA sequences by de novo synthesis remains uneconomical in time and cost4. We present an alternative, parallel strategy that enables the writing of arbitrary data on DNA using premade nucleic acids. Through self-assembly guided enzymatic methylation, epigenetic modifications, as information bits, can be introduced precisely onto universal DNA templates to enact molecular movable-type printing. By programming with a finite set of 700 DNA movable types and five templates, we achieved the synthesis-free writing of approximately 275,000 bits on an automated platform with 350 bits written per reaction. The data encoded in complex epigenetic patterns were retrieved high-throughput by nanopore sequencing, and algorithms were developed to finely resolve 240 modification patterns per sequencing reaction. With the epigenetic information bits framework, distributed and bespoke DNA storage was implemented by 60 volunteers lacking professional biolab experience. Our framework presents a new modality of DNA data storage that is parallel, programmable, stable and scalable. Such an unconventional modality opens up avenues towards practical data storage and dual-mode data functions in biomolecular systems.


Asunto(s)
Metilación de ADN , ADN , Epigénesis Genética , Almacenamiento y Recuperación de la Información , Algoritmos , ADN/química , ADN/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Almacenamiento y Recuperación de la Información/métodos , Nanoporos , Moldes Genéticos
4.
CA Cancer J Clin ; 72(3): 287-300, 2022 05.
Artículo en Inglés | MEDLINE | ID: mdl-34964981

RESUMEN

Generating evidence on the use, effectiveness, and safety of new cancer therapies is a priority for researchers, health care providers, payers, and regulators given the rapid pace of change in cancer diagnosis and treatments. The use of real-world data (RWD) is integral to understanding the utilization patterns and outcomes of these new treatments among patients with cancer who are treated in clinical practice and community settings. An initial step in the use of RWD is careful study design to assess the suitability of an RWD source. This pivotal process can be guided by using a conceptual model that encourages predesign conceptualization. The primary types of RWD included are electronic health records, administrative claims data, cancer registries, and specialty data providers and networks. Careful consideration of each data type is necessary because they are collected for a specific purpose, capturing a set of data elements within a certain population for that purpose, and they vary by population coverage and longitudinality. In this review, the authors provide a high-level assessment of the strengths and limitations of each data category to inform data source selection appropriate to the study question. Overall, the development and accessibility of RWD sources for cancer research are rapidly increasing, and the use of these data requires careful consideration of composition and utility to assess important questions in understanding the use and effectiveness of new therapies.


Asunto(s)
Almacenamiento y Recuperación de la Información , Oncología Médica , Registros Electrónicos de Salud , Humanos , Sistema de Registros , Proyectos de Investigación
5.
Cell ; 157(2): 283-284, 2014 Apr 10.
Artículo en Inglés | MEDLINE | ID: mdl-24725396

RESUMEN

To harness the strength of data sets growing by leaps and bounds every day, cultural norms in biomedical research are under pressure to change with the times.


Asunto(s)
Investigación Biomédica , Genómica , Almacenamiento y Recuperación de la Información , Animales , Investigación Biomédica/instrumentación , Investigación Biomédica/métodos , Investigación Biomédica/tendencias , Humanos
6.
Nature ; 608(7921): 217-225, 2022 08.
Artículo en Inglés | MEDLINE | ID: mdl-35896746

RESUMEN

Biological processes depend on the differential expression of genes over time, but methods to make physical recordings of these processes are limited. Here we report a molecular system for making time-ordered recordings of transcriptional events into living genomes. We do this through engineered RNA barcodes, based on prokaryotic retrons1, that are reverse transcribed into DNA and integrated into the genome using the CRISPR-Cas system2. The unidirectional integration of barcodes by CRISPR integrases enables reconstruction of transcriptional event timing based on a physical record through simple, logical rules rather than relying on pretrained classifiers or post hoc inferential methods. For disambiguation in the field, we will refer to this system as a Retro-Cascorder.


Asunto(s)
Sistemas CRISPR-Cas , ADN , Edición Génica , Expresión Génica , Almacenamiento y Recuperación de la Información , ARN , Transcripción Reversa , Sistemas CRISPR-Cas/genética , ADN/biosíntesis , ADN/genética , Edición Génica/métodos , Genoma/genética , Almacenamiento y Recuperación de la Información/métodos , Integrasas/metabolismo , Células Procariotas/metabolismo , ARN/genética , Factores de Tiempo
7.
Plant Cell ; 36(4): 812-828, 2024 Mar 29.
Artículo en Inglés | MEDLINE | ID: mdl-38231860

RESUMEN

Single-cell and single-nucleus RNA-sequencing technologies capture the expression of plant genes at an unprecedented resolution. Therefore, these technologies are gaining traction in plant molecular and developmental biology for elucidating the transcriptional changes across cell types in a specific tissue or organ, upon treatments, in response to biotic and abiotic stresses, or between genotypes. Despite the rapidly accelerating use of these technologies, collective and standardized experimental and analytical procedures to support the acquisition of high-quality data sets are still missing. In this commentary, we discuss common challenges associated with the use of single-cell transcriptomics in plants and propose general guidelines to improve reproducibility, quality, comparability, and interpretation and to make the data readily available to the community in this fast-developing field of research.


Asunto(s)
Perfilación de la Expresión Génica , Plantas , Reproducibilidad de los Resultados , Plantas/genética , Estrés Fisiológico/genética , Almacenamiento y Recuperación de la Información
8.
Proc Natl Acad Sci U S A ; 121(34): e2410164121, 2024 Aug 20.
Artículo en Inglés | MEDLINE | ID: mdl-39145927

RESUMEN

In the age of information explosion, the exponential growth of digital data far exceeds the capacity of current mainstream storage media. DNA is emerging as a promising alternative due to its higher storage density, longer retention time, and lower power consumption. To date, commercially mature DNA synthesis and sequencing technologies allow for writing and reading of information on DNA with customization and convenience at the research level. However, under the disconnected and nonspecialized mode, DNA data storage encounters practical challenges, including susceptibility to errors, long storage latency, resource-intensive requirements, and elevated information security risks. Herein, we introduce a platform named DNA-DISK that seamlessly streamlined DNA synthesis, storage, and sequencing on digital microfluidics coupled with a tabletop device for automated end-to-end information storage. The single-nucleotide enzymatic DNA synthesis with biocapping strategy is utilized, offering an ecofriendly and cost-effective approach for data writing. A DNA encapsulation using thermo-responsive agarose is developed for on-chip solidification, not only eliminating data clutter but also preventing DNA degradation. Pyrosequencing is employed for in situ and accurate data reading. As a proof of concept, DNA-DISK successfully stored and retrieved a musical sheet file (228 bits) with lower write-to-read latency (4.4 min of latency per bit) as well as superior automation compared to other platforms, demonstrating its potential to evolve into a DNA Hard Disk Drive in the future.


Asunto(s)
ADN , Microfluídica , ADN/biosíntesis , Microfluídica/métodos , Microfluídica/instrumentación , Análisis de Secuencia de ADN/métodos , Almacenamiento y Recuperación de la Información/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos
9.
Annu Rev Genomics Hum Genet ; 24: 393-414, 2023 08 25.
Artículo en Inglés | MEDLINE | ID: mdl-36913714

RESUMEN

Genome sequencing is increasingly used in research and integrated into clinical care. In the research domain, large-scale analyses, including whole genome sequencing with variant interpretation and curation, virtually guarantee identification of variants that are pathogenic or likely pathogenic and actionable. Multiple guidelines recommend that findings associated with actionable conditions be offered to research participants in order to demonstrate respect for autonomy, reciprocity, and participant interests in health and privacy. Some recommendations go further and support offering a wider range of findings, including those that are not immediately actionable. In addition, entities covered by the US Health Insurance Portability and Accountability Act (HIPAA) may be required to provide a participant's raw genomic data on request. Despite these widely endorsed guidelines and requirements, the implementation of return of genomic results and data by researchers remains uneven. This article analyzes the ethical and legal foundations for researcher duties to offer adult participants their interpreted results and raw data as the new normal in genomic research.


Asunto(s)
Genómica , Secuenciación Completa del Genoma , Genómica/métodos , Secuenciación Completa del Genoma/métodos , Humanos , United States Food and Drug Administration , Estados Unidos , Almacenamiento y Recuperación de la Información , Health Insurance Portability and Accountability Act
10.
Brief Bioinform ; 25(5)2024 Jul 25.
Artículo en Inglés | MEDLINE | ID: mdl-39288232

RESUMEN

DNA molecules as storage media are characterized by high encoding density and low energy consumption, making DNA storage a highly promising storage method. However, DNA storage has shortcomings, especially when storing multimedia data, wherein image reconstruction fails when address errors occur, resulting in complete data loss. Therefore, we propose a parity encoding and local mean iteration (PELMI) scheme to achieve robust DNA storage of images. The proposed parity encoding scheme satisfies the common biochemical constraints of DNA sequences and the undesired motif content. It addresses varying pixel weights at different positions for binary data, thus optimizing the utilization of Reed-Solomon error correction. Then, through lost and erroneous sequences, data supplementation and local mean iteration are employed to enhance the robustness. The encoding results show that the undesired motif content is reduced by 23%-50% compared with the representative schemes, which improves the sequence stability. PELMI achieves image reconstruction under general errors (insertion, deletion, substitution) and enhances the DNA sequences quality. Especially under 1% error, compared with other advanced encoding schemes, the peak signal-to-noise ratio and the multiscale structure similarity address metric were increased by 10%-13% and 46.8%-122%, respectively, and the mean squared error decreased by 113%-127%. This demonstrates that the reconstructed images had better clarity, fidelity, and similarity in structure, texture, and detail. In summary, PELMI ensures robustness and stability of image storage in DNA and achieves relatively high-quality image reconstruction under general errors.


Asunto(s)
Algoritmos , ADN , ADN/genética , Procesamiento de Imagen Asistido por Computador/métodos , Almacenamiento y Recuperación de la Información/métodos
11.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38555478

RESUMEN

DNA storage is one of the most promising ways for future information storage due to its high data storage density, durable storage time and low maintenance cost. However, errors are inevitable during synthesizing, storing and sequencing. Currently, many error correction algorithms have been developed to ensure accurate information retrieval, but they will decrease storage density or increase computing complexity. Here, we apply the Bloom Filter, a space-efficient probabilistic data structure, to DNA storage to achieve the anti-error, or anti-contamination function. This method only needs the original correct DNA sequences (referred to as target sequences) to produce a corresponding data structure, which will filter out almost all the incorrect sequences (referred to as non-target sequences) during sequencing data analysis. Experimental results demonstrate the universal and efficient filtering capabilities of our method. Furthermore, we employ the Counting Bloom Filter to achieve the file version control function, which significantly reduces synthesis costs when modifying DNA-form files. To achieve cost-efficient file version control function, a modified system based on yin-yang codec is developed.


Asunto(s)
Algoritmos , ADN , Análisis de Secuencia de ADN/métodos , ADN/genética , ADN/química , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Almacenamiento y Recuperación de la Información
12.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38555474

RESUMEN

As key oncogenic drivers in non-small-cell lung cancer (NSCLC), various mutations in the epidermal growth factor receptor (EGFR) with variable drug sensitivities have been a major obstacle for precision medicine. To achieve clinical-level drug recommendations, a platform for clinical patient case retrieval and reliable drug sensitivity prediction is highly expected. Therefore, we built a database, D3EGFRdb, with the clinicopathologic characteristics and drug responses of 1339 patients with EGFR mutations via literature mining. On the basis of D3EGFRdb, we developed a deep learning-based prediction model, D3EGFRAI, for drug sensitivity prediction of new EGFR mutation-driven NSCLC. Model validations of D3EGFRAI showed a prediction accuracy of 0.81 and 0.85 for patients from D3EGFRdb and our hospitals, respectively. Furthermore, mutation scanning of the crucial residues inside drug-binding pockets, which may occur in the future, was performed to explore their drug sensitivity changes. D3EGFR is the first platform to achieve clinical-level drug response prediction of all approved small molecule drugs for EGFR mutation-driven lung cancer and is freely accessible at https://www.d3pharma.com/D3EGFR/index.php.


Asunto(s)
Carcinoma de Pulmón de Células no Pequeñas , Aprendizaje Profundo , Neoplasias Pulmonares , Humanos , Neoplasias Pulmonares/tratamiento farmacológico , Neoplasias Pulmonares/genética , Carcinoma de Pulmón de Células no Pequeñas/tratamiento farmacológico , Carcinoma de Pulmón de Células no Pequeñas/genética , Carcinoma de Pulmón de Células no Pequeñas/patología , Receptores ErbB/genética , Mutación , Almacenamiento y Recuperación de la Información
13.
Cell ; 147(5): 973-8, 2011 Nov 23.
Artículo en Inglés | MEDLINE | ID: mdl-22118455

RESUMEN

Computer vision refers to the theory and implementation of artificial systems that extract information from images to understand their content. Although computers are widely used by cell biologists for visualization and measurement, interpretation of image content, i.e., the selection of events worth observing and the definition of what they mean in terms of cellular mechanisms, is mostly left to human intuition. This Essay attempts to outline roles computer vision may play and should play in image-based studies of cellular life.


Asunto(s)
Técnicas Citológicas/métodos , Procesamiento de Imagen Asistido por Computador/métodos , Automatización , Computadores , Humanos , Almacenamiento y Recuperación de la Información , Neutrófilos/citología
14.
Nucleic Acids Res ; 52(D1): D98-D106, 2024 Jan 05.
Artículo en Inglés | MEDLINE | ID: mdl-37953349

RESUMEN

Long noncoding RNAs (lncRNAs) have emerged as crucial regulators across diverse biological processes and diseases. While high-throughput sequencing has enabled lncRNA discovery, functional characterization remains limited. The EVLncRNAs database is the first and exclusive repository for all experimentally validated functional lncRNAs from various species. After previous releases in 2018 and 2021, this update marks a major expansion through exhaustive manual curation of nearly 25 000 publications from 15 May 2020, to 15 May 2023. It incorporates substantial growth across all categories: a 154% increase in functional lncRNAs, 160% in associated diseases, 186% in lncRNA-disease associations, 235% in interactions, 138% in structures, 234% in circular RNAs, 235% in resistant lncRNAs and 4724% in exosomal lncRNAs. More importantly, it incorporated additional information include functional classifications, detailed interaction pathways, homologous lncRNAs, lncRNA locations, COVID-19, phase-separation and organoid-related lncRNAs. The web interface was substantially improved for browsing, visualization, and searching. ChatGPT was tested for information extraction and functional overview with its limitation noted. EVLncRNAs 3.0 represents the most extensive curated resource of experimentally validated functional lncRNAs and will serve as an indispensable platform for unravelling emerging lncRNA functions. The updated database is freely available at https://www.sdklab-biophysics-dzu.net/EVLncRNAs3/.


Asunto(s)
Bases de Datos de Ácidos Nucleicos , ARN Largo no Codificante , Manejo de Datos , Almacenamiento y Recuperación de la Información , ARN Largo no Codificante/genética
15.
Pharmacol Rev ; 75(4): 714-738, 2023 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-36931724

RESUMEN

Natural language processing (NLP) is an area of artificial intelligence that applies information technologies to process the human language, understand it to a certain degree, and use it in various applications. This area has rapidly developed in the past few years and now employs modern variants of deep neural networks to extract relevant patterns from large text corpora. The main objective of this work is to survey the recent use of NLP in the field of pharmacology. As our work shows, NLP is a highly relevant information extraction and processing approach for pharmacology. It has been used extensively, from intelligent searches through thousands of medical documents to finding traces of adversarial drug interactions in social media. We split our coverage into five categories to survey modern NLP: methodology, commonly addressed tasks, relevant textual data, knowledge bases, and useful programming libraries. We split each of the five categories into appropriate subcategories, describe their main properties and ideas, and summarize them in a tabular form. The resulting survey presents a comprehensive overview of the area, useful to practitioners and interested observers. SIGNIFICANCE STATEMENT: The main objective of this work is to survey the recent use of NLP in the field of pharmacology in order to provide a comprehensive overview of the current state in the area after the rapid developments that occurred in the past few years. The resulting survey will be useful to practitioners and interested observers in the domain.


Asunto(s)
Inteligencia Artificial , Procesamiento de Lenguaje Natural , Humanos , Almacenamiento y Recuperación de la Información , Registros Electrónicos de Salud , Registros
16.
EMBO J ; 40(6): e107409, 2021 03 15.
Artículo en Inglés | MEDLINE | ID: mdl-33565128

RESUMEN

A new inter-governmental research infrastructure, ELIXIR, aims to unify bioinformatics resources and life science data across Europe, thereby facilitating their mining and (re-)use.


Asunto(s)
Investigación Biomédica , Biología Computacional , Almacenamiento y Recuperación de la Información , Disciplinas de las Ciencias Biológicas , Europa (Continente) , Humanos
17.
Brief Bioinform ; 24(4)2023 07 20.
Artículo en Inglés | MEDLINE | ID: mdl-37317617

RESUMEN

Human prescription drug labeling contains a summary of the essential scientific information needed for the safe and effective use of the drug and includes the Prescribing Information, FDA-approved patient labeling (Medication Guides, Patient Package Inserts and/or Instructions for Use), and/or carton and container labeling. Drug labeling contains critical information about drug products, such as pharmacokinetics and adverse events. Automatic information extraction from drug labels may facilitate finding the adverse reaction of the drugs or finding the interaction of one drug with another drug. Natural language processing (NLP) techniques, especially recently developed Bidirectional Encoder Representations from Transformers (BERT), have exhibited exceptional merits in text-based information extraction. A common paradigm in training BERT is to pretrain the model on large unlabeled generic language corpora, so that the model learns the distribution of the words in the language, and then fine-tune on a downstream task. In this paper, first, we show the uniqueness of language used in drug labels, which therefore cannot be optimally handled by other BERT models. Then, we present the developed PharmBERT, which is a BERT model specifically pretrained on the drug labels (publicly available at Hugging Face). We demonstrate that our model outperforms the vanilla BERT, ClinicalBERT and BioBERT in multiple NLP tasks in the drug label domain. Moreover, how the domain-specific pretraining has contributed to the superior performance of PharmBERT is demonstrated by analyzing different layers of PharmBERT, and more insight into how it understands different linguistic aspects of the data is gained.


Asunto(s)
Etiquetado de Medicamentos , Almacenamiento y Recuperación de la Información , Humanos , Aprendizaje , Procesamiento de Lenguaje Natural
18.
Brief Bioinform ; 25(1)2023 11 22.
Artículo en Inglés | MEDLINE | ID: mdl-38168838

RESUMEN

ChatGPT has drawn considerable attention from both the general public and domain experts with its remarkable text generation capabilities. This has subsequently led to the emergence of diverse applications in the field of biomedicine and health. In this work, we examine the diverse applications of large language models (LLMs), such as ChatGPT, in biomedicine and health. Specifically, we explore the areas of biomedical information retrieval, question answering, medical text summarization, information extraction and medical education and investigate whether LLMs possess the transformative power to revolutionize these tasks or whether the distinct complexities of biomedical domain presents unique challenges. Following an extensive literature survey, we find that significant advances have been made in the field of text generation tasks, surpassing the previous state-of-the-art methods. For other applications, the advances have been modest. Overall, LLMs have not yet revolutionized biomedicine, but recent rapid progress indicates that such methods hold great potential to provide valuable means for accelerating discovery and improving health. We also find that the use of LLMs, like ChatGPT, in the fields of biomedicine and health entails various risks and challenges, including fabricated information in its generated responses, as well as legal and privacy concerns associated with sensitive patient data. We believe this survey can provide a comprehensive and timely overview to biomedical researchers and healthcare practitioners on the opportunities and challenges associated with using ChatGPT and other LLMs for transforming biomedicine and health.


Asunto(s)
Almacenamiento y Recuperación de la Información , Lenguaje , Humanos , Privacidad , Investigadores
19.
Brief Bioinform ; 24(1)2023 01 19.
Artículo en Inglés | MEDLINE | ID: mdl-36410731

RESUMEN

Deoxyribonucleic acid (DNA) is an attractive medium for long-term digital data storage due to its extremely high storage density, low maintenance cost and longevity. However, during the process of synthesis, amplification and sequencing of DNA sequences with homopolymers of large run-length, three different types of errors, namely, insertion, deletion and substitution errors frequently occur. Meanwhile, DNA sequences with large imbalances between GC and AT content exhibit high dropout rates and are prone to errors. These limitations severely hinder the widespread use of DNA-based data storage. In order to reduce and correct these errors in DNA storage, this paper proposes a novel coding schema called DNA-LC, which converts binary sequences into DNA base sequences that satisfy both the GC balance and run-length constraints. Furthermore, our coding mode is able to detect and correct multiple errors with a higher error correction capability than the other methods targeting single error correction within a single strand. The decoding algorithm has been implemented in practice. Simulation results indicate that our proposed coding scheme can offer outstanding error protection to DNA sequences. The source code is freely accessible at https://github.com/XiayangLi2301/DNA.


Asunto(s)
ADN , Programas Informáticos , ADN/genética , Secuencia de Bases , Análisis de Secuencia de ADN/métodos , Algoritmos , Almacenamiento y Recuperación de la Información
20.
Bioinformatics ; 40(Suppl 1): i119-i129, 2024 06 28.
Artículo en Inglés | MEDLINE | ID: mdl-38940167

RESUMEN

SUMMARY: Recent proprietary large language models (LLMs), such as GPT-4, have achieved a milestone in tackling diverse challenges in the biomedical domain, ranging from multiple-choice questions to long-form generations. To address challenges that still cannot be handled with the encoded knowledge of LLMs, various retrieval-augmented generation (RAG) methods have been developed by searching documents from the knowledge corpus and appending them unconditionally or selectively to the input of LLMs for generation. However, when applying existing methods to different domain-specific problems, poor generalization becomes apparent, leading to fetching incorrect documents or making inaccurate judgments. In this paper, we introduce Self-BioRAG, a framework reliable for biomedical text that specializes in generating explanations, retrieving domain-specific documents, and self-reflecting generated responses. We utilize 84k filtered biomedical instruction sets to train Self-BioRAG that can assess its generated explanations with customized reflective tokens. Our work proves that domain-specific components, such as a retriever, domain-related document corpus, and instruction sets are necessary for adhering to domain-related instructions. Using three major medical question-answering benchmark datasets, experimental results of Self-BioRAG demonstrate significant performance gains by achieving a 7.2% absolute improvement on average over the state-of-the-art open-foundation model with a parameter size of 7B or less. Similarly, Self-BioRAG outperforms RAG by 8% Rouge-1 score in generating more proficient answers on two long-form question-answering benchmarks on average. Overall, we analyze that Self-BioRAG finds the clues in the question, retrieves relevant documents if needed, and understands how to answer with information from retrieved documents and encoded knowledge as a medical expert does. We release our data and code for training our framework components and model weights (7B and 13B) to enhance capabilities in biomedical and clinical domains. AVAILABILITY AND IMPLEMENTATION: Self-BioRAG is available at https://github.com/dmis-lab/self-biorag.


Asunto(s)
Almacenamiento y Recuperación de la Información , Humanos , Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Lenguaje Natural
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA