RESUMEN
OBJECTIVES: Precise literature recommendation and summarization are crucial for biomedical professionals. While the latest iteration of generative pretrained transformer (GPT) incorporates 2 distinct modes-real-time search and pretrained model utilization-it encounters challenges in dealing with these tasks. Specifically, the real-time search can pinpoint some relevant articles but occasionally provides fabricated papers, whereas the pretrained model excels in generating well-structured summaries but struggles to cite specific sources. In response, this study introduces RefAI, an innovative retrieval-augmented generative tool designed to synergize the strengths of large language models (LLMs) while overcoming their limitations. MATERIALS AND METHODS: RefAI utilized PubMed for systematic literature retrieval, employed a novel multivariable algorithm for article recommendation, and leveraged GPT-4 turbo for summarization. Ten queries under 2 prevalent topics ("cancer immunotherapy and target therapy" and "LLMs in medicine") were chosen as use cases and 3 established counterparts (ChatGPT-4, ScholarAI, and Gemini) as our baselines. The evaluation was conducted by 10 domain experts through standard statistical analyses for performance comparison. RESULTS: The overall performance of RefAI surpassed that of the baselines across 5 evaluated dimensions-relevance and quality for literature recommendation, accuracy, comprehensiveness, and reference integration for summarization, with the majority exhibiting statistically significant improvements (P-values <.05). DISCUSSION: RefAI demonstrated substantial improvements in literature recommendation and summarization over existing tools, addressing issues like fabricated papers, metadata inaccuracies, restricted recommendations, and poor reference integration. CONCLUSION: By augmenting LLM with external resources and a novel ranking algorithm, RefAI is uniquely capable of recommending high-quality literature and generating well-structured summaries, holding the potential to meet the critical needs of biomedical professionals in navigating and synthesizing vast amounts of scientific literature.
Asunto(s)
Algoritmos , Almacenamiento y Recuperación de la Información , PubMed , Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Lenguaje NaturalRESUMEN
Electroencephalography (EEG) based Brain Computer Interface (BCI) systems play a significant role in facilitating how individuals with neurological impairments effectively interact with their environment. In real world applications of BCI system for clinical assistance and rehabilitation training, the EEG classifier often needs to learn on sequentially arriving subjects in an online manner. As patterns of EEG signals can be significantly different for different subjects, the EEG classifier can easily erase knowledge of learnt subjects after learning on later ones as it performs decoding in online streaming scenario, namely catastrophic forgetting. In this work, we tackle this problem with a memory-based approach, which considers the following conditions: (1) subjects arrive sequentially in an online manner, with no large scale dataset available for joint training beforehand, (2) data volume from the different subjects could be imbalanced, (3) decoding difficulty of the sequential streaming signal vary, (4) continual classification for a long time is required. This online sequential EEG decoding problem is more challenging than classic cross subject EEG decoding as there is no large-scale training data from the different subjects available beforehand. The proposed model keeps a small balanced memory buffer during sequential learning, with memory data dynamically selected based on joint consideration of data volume and informativeness. Furthermore, for the more general scenarios where subject identity is unknown to the EEG decoder, aka. subject agnostic scenario, we propose a kernel based subject shift detection method that identifies underlying subject changes on the fly in a computationally efficient manner. We develop challenging benchmarks of streaming EEG data from sequentially arriving subjects with both balanced and imbalanced data volumes, and performed extensive experiments with a detailed ablation study on the proposed model. The results show the effectiveness of our proposed approach, enabling the decoder to maintain performance on all previously seen subjects over a long period of sequential decoding. The model demonstrates the potential for real-world applications.
Asunto(s)
Interfaces Cerebro-Computador , Electroencefalografía , Memoria , Electroencefalografía/métodos , Humanos , Memoria/fisiología , Procesamiento de Señales Asistido por Computador , Encéfalo/fisiología , AlgoritmosRESUMEN
Continual learning (CL) aims to learn a non-stationary data distribution and not forget previous knowledge. The effectiveness of existing approaches that rely on memory replay can decrease over time as the model tends to overfit the stored examples. As a result, the model's ability to generalize well is significantly constrained. Additionally, these methods often overlook the inherent uncertainty in the memory data distribution, which differs significantly from the distribution of all previous data examples. To overcome these issues, we propose a principled memory evolution framework that dynamically adjusts the memory data distribution. This evolution is achieved by employing distributionally robust optimization (DRO) to make the memory buffer increasingly difficult to memorize. We consider two types of constraints in DRO: f-divergence and Wasserstein ball constraints. For f-divergence constraint, we derive a family of methods to evolve the memory buffer data in the continuous probability measure space with Wasserstein gradient flow (WGF). For Wasserstein ball constraint, we directly solve it in the euclidean space. Extensive experiments on existing benchmarks demonstrate the effectiveness of the proposed methods for alleviating forgetting. As a by-product of the proposed framework, our method is more robust to adversarial examples than compared CL methods.
RESUMEN
MOTIVATION: With the development of droplet based systems, massive single cell transcriptome data has become available, which enables analysis of cellular and molecular processes at single cell resolution and is instrumental to understanding many biological processes. While state-of-the-art clustering methods have been applied to the data, they face challenges in the following aspects: (i) the clustering quality still needs to be improved; (ii) most models need prior knowledge on number of clusters, which is not always available; (iii) there is a demand for faster computational speed. RESULTS: We propose to tackle these challenges with Parallelized Split Merge Sampling on Dirichlet Process Mixture Model (the Para-DPMM model). Unlike classic DPMM methods that perform sampling on each single data point, the split merge mechanism samples on the cluster level, which significantly improves convergence and optimality of the result. The model is highly parallelized and can utilize the computing power of high performance computing (HPC) clusters, enabling massive inference on huge datasets. Experiment results show the model outperforms current widely used models in both clustering quality and computational speed. AVAILABILITY AND IMPLEMENTATION: Source code is publicly available on https://github.com/tiehangd/Para_DPMM/tree/master/Para_DPMM_package. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.