RESUMEN
Batch effects introduce significant variability into high-dimensional data, complicating accurate analysis and leading to potentially misleading conclusions if not adequately addressed. Despite technological and algorithmic advancements in biomedical research, effectively managing batch effects remains a complex challenge requiring comprehensive considerations. This paper underscores the necessity of a flexible and holistic approach for selecting batch effect correction algorithms (BECAs), advocating for proper BECA evaluations and consideration of artificial intelligence-based strategies. We also discuss key challenges in batch effect correction, including the importance of uncovering hidden batch factors and understanding the impact of design imbalance, missing values, and aggressive correction. Our aim is to provide researchers with a robust framework for effective batch effects management and enhancing the reliability of high-dimensional data analyses.
Asunto(s)
Algoritmos , Humanos , Inteligencia Artificial , Investigación Biomédica , Biología Computacional/métodos , Reproducibilidad de los ResultadosRESUMEN
Despite increasing prevalence of hypertension in youth and high adult cardiovascular mortality rates, the long-term consequences of youth-onset hypertension remain unknown. This is due to limitations of prior research such as small sample sizes, reliance on manual record review, and limited analytic methods that did not address major biases. The Study of the Epidemiology of Pediatric Hypertension (SUPERHERO) is a multisite retrospective Registry of youth evaluated by subspecialists for hypertension disorders. Sites obtain harmonized electronic health record data using standardized biomedical informatics scripts validated with randomized manual record review. Inclusion criteria are index visit for International Classification of Diseases Diagnostic Codes, 10th Revision (ICD-10 code)-defined hypertension disorder ≥January 1, 2015 and age <19 years. We exclude patients with ICD-10 code-defined pregnancy, kidney failure on dialysis, or kidney transplantation. Data include demographics, anthropomorphics, U.S. Census Bureau tract, histories, blood pressure, ICD-10 codes, medications, laboratory and imaging results, and ambulatory blood pressure. SUPERHERO leverages expertise in epidemiology, statistics, clinical care, and biomedical informatics to create the largest and most diverse registry of youth with newly diagnosed hypertension disorders. SUPERHERO's goals are to (i) reduce CVD burden across the life course and (ii) establish gold-standard biomedical informatics methods for youth with hypertension disorders.
RESUMEN
Many approaches in biomedical informatics (BMI) rely on the ability to define, gather, and manipulate biomedical data to support health through a cyclical research-practice lifecycle. Researchers within this field are often fortunate to work closely with healthcare and public health systems to influence data generation and capture and have access to a vast amount of biomedical data. Many informaticists also have the expertise to engage with stakeholders, develop new methods and applications, and influence policy. However, research and policy that explicitly seeks to address the systemic drivers of health would more effectively support health. Intersectionality is a theoretical framework that can facilitate such research. It holds that individual human experiences reflect larger socio-structural level systems of privilege and oppression, and cannot be truly understood if these systems are examined in isolation. Intersectionality explicitly accounts for the interrelated nature of systems of privilege and oppression, providing a lens through which to examine and challenge inequities. In this paper, we propose intersectionality as an intervention into how we conduct BMI research. We begin by discussing intersectionality's history and core principles as they apply to BMI. We then elaborate on the potential for intersectionality to stimulate BMI research. Specifically, we posit that our efforts in BMI to improve health should address intersectionality's five key considerations: (1) systems of privilege and oppression that shape health; (2) the interrelated nature of upstream health drivers; (3) the nuances of health outcomes within groups; (4) the problematic and power-laden nature of categories that we assign to people in research and in society; and (5) research to inform and support social change.
Asunto(s)
Informática Médica , Humanos , Informática Médica/métodos , Investigación BiomédicaRESUMEN
BACKGROUND: The growing recognition of the microbiome's impact on human health and well-being has prompted extensive research into discovering the links between microbiome dysbiosis and disease (healthy) states. However, this valuable information is scattered in unstructured form within biomedical literature. The structured extraction and qualification of microbe-disease interactions are important. In parallel, recent advancements in deep-learning-based natural language processing algorithms have revolutionized language-related tasks such as ours. This study aims to leverage state-of-the-art deep-learning language models to extract microbe-disease relationships from biomedical literature. RESULTS: In this study, we first evaluate multiple pre-trained large language models within a zero-shot or few-shot learning context. In this setting, the models performed poorly out of the box, emphasizing the need for domain-specific fine-tuning of these language models. Subsequently, we fine-tune multiple language models (specifically, GPT-3, BioGPT, BioMedLM, BERT, BioMegatron, PubMedBERT, BioClinicalBERT, and BioLinkBERT) using labeled training data and evaluate their performance. Our experimental results demonstrate the state-of-the-art performance of these fine-tuned models ( specifically GPT-3, BioMedLM, and BioLinkBERT), achieving an average F1 score, precision, and recall of over [Formula: see text] compared to the previous best of 0.74. CONCLUSION: Overall, this study establishes that pre-trained language models excel as transfer learners when fine-tuned with domain and problem-specific data, enabling them to achieve state-of-the-art results even with limited training data for extracting microbiome-disease interactions from scientific publications.
Asunto(s)
Algoritmos , Lenguaje , Humanos , Procesamiento de Lenguaje Natural , Estado de Salud , AprendizajeRESUMEN
Mechanical ventilation is an essential tool in the management of Acute Respiratory Distress Syndrome (ARDS), but it exposes patients to the risk of ventilator-induced lung injury (VILI). The human lung-ventilator system (LVS) involves the interaction of complex anatomy with a mechanical apparatus, which limits the ability of process-based models to provide individualized clinical support. This work proposes a hypothesis-driven strategy for LVS modeling in which robust personalization is achieved using a pre-defined parameter basis in a non-physiological model. Model inversion, here via windowed data assimilation, forges observed waveforms into interpretable parameter values that characterize the data rather than quantifying physiological processes. Accurate, model-based inference on human-ventilator data indicates model flexibility and utility over a variety of breath types, including those from dyssynchronous LVSs. Estimated parameters generate static characterizations of the data that are 50%-70% more accurate than breath-wise single-compartment model estimates. They also retain sufficient information to distinguish between the types of breath they represent. However, the fidelity and interpretability of model characterizations are tied to parameter definitions and model resolution. These additional factors must be considered in conjunction with the objectives of specific applications, such as identifying and tracking the development of human VILI.
Asunto(s)
Síndrome de Dificultad Respiratoria , Lesión Pulmonar Inducida por Ventilación Mecánica , Humanos , Respiración Artificial/efectos adversos , Síndrome de Dificultad Respiratoria/etiología , Ventiladores Mecánicos , Lesión Pulmonar Inducida por Ventilación Mecánica/etiología , PulmónRESUMEN
Building on previous work to define the scientific discipline of biomedical informatics, we present a framework that categorizes fundamental challenges into groups based on data, information, and knowledge, along with the transitions between these levels. We define each level and argue that the framework provides a basis for separating informatics problems from non-informatics problems, identifying fundamental challenges in biomedical informatics, and provides guidance regarding the search for general, reusable solutions to informatics problems. We distinguish between processing data (symbols) and processing meaning. Computational systems, that are the basis for modern information technology (IT), process data. In contrast, many important challenges in biomedicine, such as providing clinical decision support, require processing meaning, not data. Biomedical informatics is hard because of the fundamental mismatch between many biomedical problems and the capabilities of current technology.
Asunto(s)
Sistemas de Apoyo a Decisiones Clínicas , Informática Médica , ConocimientoRESUMEN
BACKGROUND: Few-shot learning (FSL) is a class of machine learning methods that require small numbers of labeled instances for training. With many medical topics having limited annotated text-based data in practical settings, FSL-based natural language processing (NLP) holds substantial promise. We aimed to conduct a review to explore the current state of FSL methods for medical NLP. METHODS: We searched for articles published between January 2016 and October 2022 using PubMed/Medline, Embase, ACL Anthology, and IEEE Xplore Digital Library. We also searched the preprint servers (e.g., arXiv, medRxiv, and bioRxiv) via Google Scholar to identify the latest relevant methods. We included all articles that involved FSL and any form of medical text. We abstracted articles based on the data source, target task, training set size, primary method(s)/approach(es), and evaluation metric(s). RESULTS: Fifty-one articles met our inclusion criteria-all published after 2018, and most since 2020 (42/51; 82%). Concept extraction/named entity recognition was the most frequently addressed task (21/51; 41%), followed by text classification (16/51; 31%). Thirty-two (61%) articles reconstructed existing datasets to fit few-shot scenarios, and MIMIC-III was the most frequently used dataset (10/51; 20%). 77% of the articles attempted to incorporate prior knowledge to augment the small datasets available for training. Common methods included FSL with attention mechanisms (20/51; 39%), prototypical networks (11/51; 22%), meta-learning (7/51; 14%), and prompt-based learning methods, the latter being particularly popular since 2021. Benchmarking experiments demonstrated relative underperformance of FSL methods on biomedical NLP tasks. CONCLUSION: Despite the potential for FSL in biomedical NLP, progress has been limited. This may be attributed to the rarity of specialized data, lack of standardized evaluation criteria, and the underperformance of FSL methods on biomedical topics. The creation of publicly-available specialized datasets for biomedical FSL may aid method development by facilitating comparative analyses.
Asunto(s)
Aprendizaje Automático , Procesamiento de Lenguaje Natural , PubMed , MEDLINE , PublicacionesRESUMEN
BACKGROUND: Testicular sperm extraction (TESE) is an essential therapeutic tool for the management of male infertility. However, it is an invasive procedure with a success rate up to 50%. To date, no model based on clinical and laboratory parameters is sufficiently powerful to accurately predict the success of sperm retrieval in TESE. OBJECTIVE: The aim of this study is to compare a wide range of predictive models under similar conditions for TESE outcomes in patients with nonobstructive azoospermia (NOA) to identify the correct mathematical approach to apply, most appropriate study size, and relevance of the input biomarkers. METHODS: We analyzed 201 patients who underwent TESE at Tenon Hospital (Assistance Publique-Hôpitaux de Paris, Sorbonne University, Paris), distributed in a retrospective training cohort of 175 patients (January 2012 to April 2021) and a prospective testing cohort (May 2021 to December 2021) of 26 patients. Preoperative data (according to the French standard exploration of male infertility, 16 variables) including urogenital history, hormonal data, genetic data, and TESE outcomes (representing the target variable) were collected. A TESE was considered positive if we obtained sufficient spermatozoa for intracytoplasmic sperm injection. After preprocessing the raw data, 8 machine learning (ML) models were trained and optimized on the retrospective training cohort data set: The hyperparameter tuning was performed by random search. Finally, the prospective testing cohort data set was used for the model evaluation. The metrics used to evaluate and compare the models were the following: sensitivity, specificity, area under the receiver operating characteristic curve (AUC-ROC), and accuracy. The importance of each variable in the model was assessed using the permutation feature importance technique, and the optimal number of patients to include in the study was assessed using the learning curve. RESULTS: The ensemble models, based on decision trees, showed the best performance, especially the random forest model, which yielded the following results: AUC=0.90, sensitivity=100%, and specificity=69.2%. Furthermore, a study size of 120 patients seemed sufficient to properly exploit the preoperative data in the modeling process, since increasing the number of patients beyond 120 during model training did not bring any performance improvement. Furthermore, inhibin B and a history of varicoceles exhibited the highest predictive capacity. CONCLUSIONS: An ML algorithm based on an appropriate approach can predict successful sperm retrieval in men with NOA undergoing TESE, with promising performance. However, although this study is consistent with the first step of this process, a subsequent formal prospective multicentric validation study should be undertaken before any clinical applications. As future work, we consider the use of recent and clinically relevant data sets (including seminal plasma biomarkers, especially noncoding RNAs, as markers of residual spermatogenesis in NOA patients) to improve our results even more.
Asunto(s)
Azoospermia , Infertilidad Masculina , Humanos , Masculino , Azoospermia/diagnóstico , Azoospermia/terapia , Semen , Estudios Retrospectivos , Estudios Prospectivos , Espermatozoides , AlgoritmosRESUMEN
BACKGROUND: Data transformations are commonly used in bioinformatics data processing in the context of data projection and clustering. The most used Euclidean metric is not scale invariant and therefore occasionally inappropriate for complex, e.g., multimodal distributed variables and may negatively affect the results of cluster analysis. Specifically, the squaring function in the definition of the Euclidean distance as the square root of the sum of squared differences between data points has the consequence that the value 1 implicitly defines a limit for distances within clusters versus distances between (inter-) clusters. METHODS: The Euclidean distances within a standard normal distribution (N(0,1)) follow a N(0,[Formula: see text]) distribution. The EDO-transformation of a variable X is proposed as [Formula: see text] following modeling of the standard deviation s by a mixture of Gaussians and selecting the dominant modes via item categorization. The method was compared in artificial and biomedical datasets with clustering of untransformed data, z-transformed data, and the recently proposed pooled variable scaling. RESULTS: A simulation study and applications to known real data examples showed that the proposed EDO scaling method is generally useful. The clustering results in terms of cluster accuracy, adjusted Rand index and Dunn's index outperformed the classical alternatives. Finally, the EDO transformation was applied to cluster a high-dimensional genomic dataset consisting of gene expression data for multiple samples of breast cancer tissues, and the proposed approach gave better results than classical methods and was compared with pooled variable scaling. CONCLUSIONS: For multivariate procedures of data analysis, it is proposed to use the EDO transformation as a better alternative to the established z-standardization, especially for nontrivially distributed data. The "EDOtrans" R package is available at https://cran.r-project.org/package=EDOtrans .
Asunto(s)
Algoritmos , Biología Computacional , Análisis por Conglomerados , Genómica , Distribución NormalRESUMEN
Owning to the rapid development of computer technologies, an increasing number of relational data have been emerging in modern biomedical research. Many network-based learning methods have been proposed to perform analysis on such data, which provide people a deep understanding of topology and knowledge behind the biomedical networks and benefit a lot of applications for human healthcare. However, most network-based methods suffer from high computational and space cost. There remain challenges on handling high dimensionality and sparsity of the biomedical networks. The latest advances in network embedding technologies provide new effective paradigms to solve the network analysis problem. It converts network into a low-dimensional space while maximally preserves structural properties. In this way, downstream tasks such as link prediction and node classification can be done by traditional machine learning methods. In this survey, we conduct a comprehensive review of the literature on applying network embedding to advance the biomedical domain. We first briefly introduce the widely used network embedding models. After that, we carefully discuss how the network embedding approaches were performed on biomedical networks as well as how they accelerated the downstream tasks in biomedical science. Finally, we discuss challenges the existing network embedding applications in biomedical domains are faced with and suggest several promising future directions for a better improvement in human healthcare.
RESUMEN
We describe the design, implementation, and impact of a data harmonization, data quality checking, and dynamic report generation application in an international observational HIV research network. The IeDEA Harmonist Data Toolkit is a web-based application written in the open source programming language R, employs the R/Shiny and RMarkdown packages, and leverages the REDCap data collection platform for data model definition and user authentication. The Toolkit performs data quality checks on uploaded datasets, checks for conformance with the network's common data model, displays the results both interactively and in downloadable reports, and stores approved datasets in secure cloud storage for retrieval by the requesting investigator. Including stakeholders and users in the design process was key to the successful adoption of the application. A survey of regional data managers as well as initial usage metrics indicate that the Toolkit saves time and results in improved data quality, with a 61% mean reduction in the number of error records in a dataset. The generalized application design allows the Toolkit to be easily adapted to other research networks.
Asunto(s)
Exactitud de los Datos , Infecciones por VIH , Recolección de Datos , Humanos , Difusión de la Información , Programas InformáticosRESUMEN
Significant technological advances made in recent years have shepherded a dramatic increase in utilization of digital technologies for biomedicine- everything from the widespread use of electronic health records to improved medical imaging capabilities and the rising ubiquity of genomic sequencing contribute to a "digitization" of biomedical research and clinical care. With this shift toward computerized tools comes a dramatic increase in the amount of available data, and current tools for data analysis capable of extracting meaningful knowledge from this wealth of information have yet to catch up. This article seeks to provide an overview of emerging mathematical methods with the potential to improve the abilities of clinicians and researchers to analyze biomedical data, but may be hindered from doing so by a lack of conceptual accessibility and awareness in the life sciences research community. In particular, we focus on topological data analysis (TDA), a set of methods grounded in the mathematical field of algebraic topology that seeks to describe and harness features related to the "shape" of data. We aim to make such techniques more approachable to non-mathematicians by providing a conceptual discussion of their theoretical foundations followed by a survey of their published applications to scientific research. Finally, we discuss the limitations of these methods and suggest potential avenues for future work integrating mathematical tools into clinical care and biomedical informatics.
Asunto(s)
Análisis de Datos , Diagnóstico por ImagenRESUMEN
Scholars from the health and medical sciences have recently proposed the term social informatics (SI) as a new scientific subfield of health informatics (HI). However, SI is not a new academic concept; in fact, it has been continuously used in the social sciences and informatics since the 1970s. Although the dominant understanding of SI was established in the 1990s in the United States, a rich international perspective on SI has existed since the 1970s in other regions of the world. When that perspective is considered, the fields of understanding can be structured into 7 SI schools of thought. Against that conceptual background, this paper contributes to the discussion on the relationship between SI and HI, outlining possible perspectives of SI that are associated with health, medical, and clinical aspects. This paper argues against the multiplication and inconsistent appearance of the term SI when newly used in health and medical sciences. A more explicit name for the area that uses health and social data to advance individual and population health might be helpful to overcome this issue; giving an identity to this new field would help it to be understood more precisely and bring greater separation. This labeling could be fruitful for further segmentation of HI, which is rapidly expanding.
Asunto(s)
Informática Médica , Humanos , Internacionalidad , Estados UnidosRESUMEN
BACKGROUND: Opioid use disorder (OUD) has become an urgent health problem. People with OUD often experience comorbid medical conditions. Systematical approaches to identifying co-occurring conditions of OUD can facilitate a deeper understanding of OUD mechanisms and drug discovery. This study presents an integrated approach combining data mining, network construction and ranking, and hypothesis-driven case-control studies using patient electronic health records (EHRs). METHODS: First, we mined comorbidities from the US Food and Drug Administration Adverse Event Reporting System (FAERS) of 12 million unique case reports using frequent pattern-growth algorithm. The performance of OUD comorbidity mining was measured by precision and recall using manually curated known OUD comorbidities. We then constructed a disease comorbidity network using mined association rules and further prioritized OUD comorbidities. Last, novel OUD comorbidities were independently tested using EHRs of 75 million unique patients. RESULTS: The OUD comorbidities from association rules mining achieves a precision of 38.7% and a recall of 78.2 Based on the mined rules, the global DCN was constructed with 1916 nodes and 32,175 edges. The network-based OUD ranking result shows that 43 of 55 known OUD comorbidities were in the first decile with a precision of 78.2%. Hypothyroidism and type 2 diabetes were two top-ranked novel OUD comorbidities identified by data mining and network ranking algorithms. Based on EHR-based case-control studies, we showed that patients with OUD had significantly increased risk for hyperthyroidism (AOR = 1.46, 95% CI 1.43-1.49, p value < 0.001), hypothyroidism (AOR = 1.45, 95% CI 1.42-1.48, p value < 0.001), type 2-diabetes (AOR = 1.28, 95% CI 1.26-1.29, p value < 0.001), compared with individuals without OUD. CONCLUSION: Our study developed an integrated approach for identifying and validating novel OUD comorbidities from health records of 87 million unique patients (12 million for discovery and 75 million for validation), which can offer new opportunities for OUD mechanism understanding, drug discovery, and multi-component service delivery for co-occurring medical conditions among patients with OUD.
Asunto(s)
Diabetes Mellitus Tipo 2 , Hipotiroidismo , Trastornos Relacionados con Opioides , Comorbilidad , Diabetes Mellitus Tipo 2/epidemiología , Registros Electrónicos de Salud , Humanos , Hipotiroidismo/complicaciones , Trastornos Relacionados con Opioides/epidemiología , Estados Unidos/epidemiología , United States Food and Drug AdministrationRESUMEN
The Future of Nursing 2020 to 2030 report explicitly addresses the need for integration of nursing expertise in designing, generating, analyzing, and applying data to support initiatives focused on social determinants of health (SDOH) and health equity. The metrics necessary to enable and evaluate progress on all recommendations require harnessing existing data sources and developing new ones, as well as transforming and integrating data into information systems to facilitate communication, information sharing, and decision making among the key stakeholders. We examine the recommendations of the 2021 report through an interdisciplinary lens that integrates nursing, biomedical informatics, and data science by addressing three critical questions: (a) what data are needed?, (b) what infrastructure and processes are needed to transform data into information?, and (c) what information systems are needed to "level up" nurse-led interventions from the micro-level to the meso- and macro-levels to address social determinants of health and advance health equity?
Asunto(s)
Ciencia de los Datos , Equidad en Salud , Humanos , Informática , Difusión de la Información , Determinantes Sociales de la SaludRESUMEN
Through his visionary leadership as Director of the U.S. National Library of Medicine (NLM), Donald A. B. Lindberg M.D. influenced future generations of informatics professionals and the field of biomedical informatics itself. This chapter describes Dr. Lindberg's role in sponsoring and shaping the NLM's Institutional T15 training programs.
RESUMEN
The US National Library of Medicine's Biomedical Informatics Short Course ran from 1992 to 2017, most of that time at the Marine Biological Laboratory in Woods Hole, Massachusetts. Its intention was to provide physicians, medical librarians and others engaged in health care with a basic understanding of the major topics in informatics so that they could return to their home institutions as "change agents". Over the years, the course provided week-long, intense, morning-to-night experiences for some 1,350 students, consisting of lectures and hands-on project development, taught by many luminaries in the field, not the least of which was Donald A.B. Lindberg M.D., who spoke on topics ranging from bioinformatics to national policy.
RESUMEN
The U.S. National Library of Medicine's (NLM) funding for biomedical informatics research in the 1980s and 1990s focused on clinical decision support systems, which were also the focus of research for Donald A.B. Lindberg M.D. prior to becoming NLM's director. The portfolio of projects expanded over the years. At NLM, Dr. Lindberg supported various large infrastructure programs that enabled biomedical informatics research, as well as investigator-initiated research projects that increasingly included biotechnology/bioinformatics and health services research. The authors review NLM's sponsorship of research during Dr. Lindberg's tenure as its Director. NLM's funding significantly increased in the 2000's and beyond. Authors report an analysis of R01 topics from 1985-2016 using data from NIH RePORTER. Dr. Lindberg's legacy for biomedical informatics research is reflected by the research NLM supported under his leadership. The number of R01s remained steady over the years, but the funds provided within awards increased over time. A significant amount of NLM funds listed in RePORTER went into various types of infrastructure projects that laid a solid foundation for biomedical informatics research over multiple decades.
RESUMEN
This overview summary of the Informatics Section of the book Transforming biomedical informatics and health information access: Don Lindberg and the U.S. National Library of Medicine illustrates how the NLM revolutionized the field of biomedical and health informatics during Lindberg's term as NLM Director. Authors present a before-and-after perspective of what changed, how it changed, and the impact of those changes.
RESUMEN
Gliomas are the most common neuroepithelial brain tumors. The modern classification of tumors of central nervous system and treatment approaches are based on tissue and molecular features of a particular neoplasm. Today, histological and molecular genetic typing of tumors can only be carried out through invasive procedures. In this regard, non-invasive preoperative diagnosis in neurooncology is appreclated. One of the perspective areas is artificial intelligence applied for neuroimaging to identify significant patterns associated with histological and molecular profiles of tumors and not obvlous for a specialist. OBJECTIVE: To evaluate diagnostic accuracy of deep learning methods for glioma typing according to the 2007 WHO classification based on preoperative magnetic resonance imaging (MRI) data. MATERIAL AND METHODS: The study included MR scans of patients with glial tumors undergoing neurosurgical treatment at the Burdenko National Medical Research Center for Neurosurgery. All patients underwent preoperative contrast-enhanced MRI. 2D and 3D MR scans were used for learning of artificial neural networks with two architectures (Resnest200e and DenseNet, respectively) in classifying tumors into 4 categories (WHO grades I-IV). Learning was provided on 80% of random examinations. Classification quality metrics were evaluated in other 20% of examinations (validation and test samples). RESULTS: Analysis included 707 contrast-enhanced T1 welghted images. 3D classification based on DenseNet model showed the best result in predicting WHO tumor grade (accuracy 83%, AUC 0.95). Other authors reported similar results for other methods. CONCLUSION: The first results of our study confirmed the fundamental possibility of grading axial contrast-enhanced T1 images according to the 2007 WHO classes using deep learning models.