Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 66
Filter
1.
medRxiv ; 2024 Aug 13.
Article in English | MEDLINE | ID: mdl-39185523

ABSTRACT

Objectives: We invited inexperienced clinical researchers to analyze coded health datasets and develop hypotheses. We recorded and analyzed their hypothesis generation process. All the hypotheses generated in the process were rated by the same group of seven experts by using the same metrics. This case study examines the higher quality (i.e., higher ratings) and lower quality of hypotheses and participants who generated them. We characterized the contextual factors associated with the quality of hypotheses. Methods: All participants (i.e., clinical researchers) completed a 2-hour study session to analyze data and generate scientific hypotheses using the think-aloud method. Participants' screen activity and audio were recorded and transcribed. These transcriptions were used to measure the time used to generate each hypothesis and to code cognitive events (i.e., cognitive activities used when generating hypotheses, for example, "Seeking for Connection" describes an attempt to draw connections between data points). The hypothesis ratings by the expert panel were used as the quality of the hypotheses during the analysis. We analyzed the factors associated with (1) the five highest and (2) five lowest rated hypotheses and (3) the participants who generated them, including the number of hypotheses per participant, the validity of those hypotheses, the number of cognitive events used for each hypothesis, as well as the participant's research experience and basic demographics. Results: Participants who generated the five highest-rated hypotheses used similar lengths of time (difference 3:03), whereas those who generated the five lowest-rated hypotheses used more varying lengths of time (difference 7:13). Participants who generated the five highest-rated hypotheses also utilized slightly fewer cognitive events on average compared to the five lowest-rated hypotheses (4 per hypothesis vs. 4.8 per hypothesis). When we examine the participants (who generated the five highest and five lowest hypotheses) and their total hypotheses generated during the 2-hour study sessions, the participants with the five highest-rated hypotheses again had a shorter range of time per hypothesis on average (0:03:34 vs. 0:07:17). They (with the five highest ratings) used fewer cognitive events per hypothesis (3.498 vs. 4.626). They (with the five highest ratings) also had a higher percentage of valid rate (75.51% vs. 63.63%) and generally had more experience with clinical research. Conclusion: The quality of the hypotheses was shown to be associated with the time taken to generate them, where too long or too short time to generate hypotheses appears to be negatively associated with the hypotheses' quality ratings. Also, having more experience seems to positively correlate with higher ratings of hypotheses and higher valid rates. Validity is a quality dimension used by the expert panel during rating. However, we acknowledge that our results are anecdotal. The effect may not be simply linear, and future research is necessary. These results underscore the multi-factor nature of hypothesis generation.

2.
Med Res Arch ; 12(2)2024 Feb.
Article in English | MEDLINE | ID: mdl-39211055

ABSTRACT

Hypothesis generation is an early and critical step in any hypothesis-driven clinical research project. Because it is not yet a well-understood cognitive process, the need to improve the process goes unrecognized. Without an impactful hypothesis, the significance of any research project can be questionable, regardless of the rigor or diligence applied in other steps of the study, e.g., study design, data collection, and result analysis. In this perspective article, the authors provide a literature review on the following topics first: scientific thinking, reasoning, medical reasoning, literature-based discovery, and a field study to explore scientific thinking and discovery. Over the years, scientific thinking has shown excellent progress in cognitive science and its applied areas: education, medicine, and biomedical research. However, a review of the literature reveals the lack of original studies on hypothesis generation in clinical research. The authors then summarize their first human participant study exploring data-driven hypothesis generation by clinical researchers in a simulated setting. The results indicate that a secondary data analytical tool, VIADS-a visual interactive analytic tool for filtering, summarizing, and visualizing large health data sets coded with hierarchical terminologies, can shorten the time participants need, on average, to generate a hypothesis and also requires fewer cognitive events to generate each hypothesis. As a counterpoint, this exploration also indicates that the quality ratings of the hypotheses thus generated carry significantly lower ratings for feasibility when applying VIADS. Despite its small scale, the study confirmed the feasibility of conducting a human participant study directly to explore the hypothesis generation process in clinical research. This study provides supporting evidence to conduct a larger-scale study with a specifically designed tool to facilitate the hypothesis-generation process among inexperienced clinical researchers. A larger study could provide generalizable evidence, which in turn can potentially improve clinical research productivity and overall clinical research enterprise.

3.
Sci Rep ; 14(1): 17528, 2024 07 30.
Article in English | MEDLINE | ID: mdl-39080444

ABSTRACT

HistoLens is an open-source graphical user interface developed using MATLAB AppDesigner for visual and quantitative analysis of histological datasets. HistoLens enables users to interrogate sets of digitally annotated whole slide images to efficiently characterize histological differences between disease and experimental groups. Users can dynamically visualize the distribution of 448 hand-engineered features quantifying color, texture, morphology, and distribution across microanatomic sub-compartments. Additionally, users can map differentially detected image features within the images by highlighting affected regions. We demonstrate the utility of HistoLens to identify hand-engineered features that correlate with pathognomonic renal glomerular characteristics distinguishing diabetic nephropathy and amyloid nephropathy from the histologically unremarkable glomeruli in minimal change disease. Additionally, we examine the use of HistoLens for glomerular feature discovery in the Tg26 mouse model of HIV-associated nephropathy. We identify numerous quantitative glomerular features distinguishing Tg26 transgenic mice from wild-type mice, corresponding to a progressive renal disease phenotype. Thus, we demonstrate an off-the-shelf and ready-to-use toolkit for quantitative renal pathology applications.


Subject(s)
Mice, Transgenic , Animals , Mice , Kidney Glomerulus/pathology , Kidney/pathology , Kidney Diseases/pathology , Disease Models, Animal , Diabetic Nephropathies/pathology , Humans , Image Processing, Computer-Assisted/methods
4.
BMC Bioinformatics ; 25(1): 213, 2024 Jun 13.
Article in English | MEDLINE | ID: mdl-38872097

ABSTRACT

BACKGROUND: Automated hypothesis generation (HG) focuses on uncovering hidden connections within the extensive information that is publicly available. This domain has become increasingly popular, thanks to modern machine learning algorithms. However, the automated evaluation of HG systems is still an open problem, especially on a larger scale. RESULTS: This paper presents a novel benchmarking framework Dyport for evaluating biomedical hypothesis generation systems. Utilizing curated datasets, our approach tests these systems under realistic conditions, enhancing the relevance of our evaluations. We integrate knowledge from the curated databases into a dynamic graph, accompanied by a method to quantify discovery importance. This not only assesses hypotheses accuracy but also their potential impact in biomedical research which significantly extends traditional link prediction benchmarks. Applicability of our benchmarking process is demonstrated on several link prediction systems applied on biomedical semantic knowledge graphs. Being flexible, our benchmarking system is designed for broad application in hypothesis generation quality verification, aiming to expand the scope of scientific discovery within the biomedical research community. CONCLUSIONS: Dyport is an open-source benchmarking framework designed for biomedical hypothesis generation systems evaluation, which takes into account knowledge dynamics, semantics and impact. All code and datasets are available at: https://github.com/IlyaTyagin/Dyport .


Subject(s)
Benchmarking , Benchmarking/methods , Algorithms , Biomedical Research/methods , Software , Machine Learning , Databases, Factual , Computational Biology/methods , Semantics
5.
Pharmacoepidemiol Drug Saf ; 33(3): e5765, 2024 Mar.
Article in English | MEDLINE | ID: mdl-38453354

ABSTRACT

PURPOSE: We develop an open-source R package to implement tree-based scan statistics (TBSS) analyses. METHODS: TBSS are data mining methods used by the United States Food and Drug Administration and the Centers for Disease Control. They simultaneously screen thousands of hierarchically aggregated outcomes to identify unsuspected adverse effects of drugs or vaccines, accounting for multiple comparisons. The general structure of TBSS is highly adaptable, with four essential components: (1) a hierarchical outcome structure, (2) a test statistic to be computed for each element of the hierarchy, (3) an algorithm to generate data replicates under a null distribution, and (4) observed outcomes at the lower level of the hierarchy. We encode the general TBSS framework in a convenient R package that offers user-friendly functions for the most used TBSS methods. To illustrate the performance of our software, we evaluated two examples of archetypical TBSS analyses previously analyzed using proprietary, closed-source TreeScan™ software. The first considers the risk of congenital malformations associated with first-trimester exposure to valproate, and the second compares exposure to newly prescribed canagliflozin with a dipeptidyl peptidase 4 inhibitor in adults affected by type 2 diabetes. RESULTS: The results of the original studies are replicated. CONCLUSIONS: The diffusion of an open-source implementation of TBSS can enhance innovation of TBSS methods and foster collaborations. We offer an intuitive R package implementing standard TBSS methods with accompanying tutorials. Our unified object-oriented implementation allows expert users to extend the framework, introduce new features, or enhance existing ones.


Subject(s)
Diabetes Mellitus, Type 2 , Vaccines , Adult , Humans , Diabetes Mellitus, Type 2/drug therapy , Diabetes Mellitus, Type 2/epidemiology , Software , Algorithms , Hypoglycemic Agents
6.
J Clin Transl Sci ; 8(1): e13, 2024.
Article in English | MEDLINE | ID: mdl-38384898

ABSTRACT

Objectives: To compare how clinical researchers generate data-driven hypotheses with a visual interactive analytic tool (VIADS, a visual interactive analysis tool for filtering and summarizing large datasets coded with hierarchical terminologies) or other tools. Methods: We recruited clinical researchers and separated them into "experienced" and "inexperienced" groups. Participants were randomly assigned to a VIADS or control group within the groups. Each participant conducted a remote 2-hour study session for hypothesis generation with the same study facilitator on the same datasets by following a think-aloud protocol. Screen activities and audio were recorded, transcribed, coded, and analyzed. Hypotheses were evaluated by seven experts on their validity, significance, and feasibility. We conducted multilevel random effect modeling for statistical tests. Results: Eighteen participants generated 227 hypotheses, of which 147 (65%) were valid. The VIADS and control groups generated a similar number of hypotheses. The VIADS group took a significantly shorter time to generate one hypothesis (e.g., among inexperienced clinical researchers, 258 s versus 379 s, p = 0.046, power = 0.437, ICC = 0.15). The VIADS group received significantly lower ratings than the control group on feasibility and the combination rating of validity, significance, and feasibility. Conclusion: The role of VIADS in hypothesis generation seems inconclusive. The VIADS group took a significantly shorter time to generate each hypothesis. However, the combined validity, significance, and feasibility ratings of their hypotheses were significantly lower. Further characterization of hypotheses, including specifics on how they might be improved, could guide future tool development.

7.
Cogn Neuropsychiatry ; 29(1): 10-28, 2024 01.
Article in English | MEDLINE | ID: mdl-38348821

ABSTRACT

INTRODUCTION: Koro is a delusion whereby a man believes his penis is shrinking into his abdomen and this may result in his death. This socially-transmitted non-neuropsychological delusional belief occurs (in epidemic form) in South-East and South Asia. We investigated whether the two-factor theory of delusion could be applied to epidemic Koro. METHODS: We scrutinised the literature on epidemic Koro to isolate features relevant to the two questions that must be answered to provide a two-factor account: What could initially prompt the Koro delusional hypothesis? Why is this hypothesis adopted as a belief? RESULTS: We concluded that the Koro hypothesis is usually prompted by the surprising observation of actual penis shrinkage-but only if the man has access to background beliefs about Koro. Whether the hypothesis is then adopted as a belief will depend on individual factors such as prior belief in the Koro concept or limited formal education and sociocultural factors such as deference to culture, to media, or to rumours spread by word of mouth. Social transmission can influence how the first factor works and how the second factor works. CONCLUSION: The two-factor theory of delusion can be applied to a socially-transmitted delusion that occurs in epidemic form.


Subject(s)
Koro , Male , Humans , Koro/epidemiology , Koro/psychology , Delusions/psychology
8.
J Biomed Inform ; 151: 104607, 2024 03.
Article in English | MEDLINE | ID: mdl-38360080

ABSTRACT

OBJECTIVES: Hypothesis Generation (HG) is a task that aims to uncover hidden associations between disjoint scientific terms, which influences innovations in prevention, treatment, and overall public health. Several recent studies strive to use Recurrent Neural Network (RNN) to learn evolutional embeddings for HG. However, the complex spatiotemporal dependencies of term-pair relations will be difficult to depict due to the inherent recurrent structure. This paper aims to accurately model the temporal evolution of term-pair relations using only attention mechanisms, for capturing crucial information on inferring the future connectivities. METHODS: This paper proposes a Temporal Attention Networks (TAN) to produce powerful spatiotemporal embeddings for Biomedical Hypothesis Generation. Specifically, we formulate HG problem as a future connectivity prediction task in a temporal attributed graph. Our TAN develops a Temporal Spatial Attention Module (TSAM) to establish temporal dependencies of node-pair (term-pair) embeddings between any two time-steps for smoothing spatiotemporal node-pair embeddings. Meanwhile, a Temporal Difference Attention Module (TDAM) is proposed to sharpen temporal differences of spatiotemporal embeddings for highlighting the historical changes of node-pair relations. As such, TAN can adaptively calibrate spatiotemporal embeddings by considering both continuity and difference of node-pair embeddings. RESULTS: Three real-world biomedical term relationship datasets are constructed from PubMed papers. TAN significantly outperforms the best baseline with 12.03%, 4.59 and 2.34% Micro-F1 Score improvement in Immunotherapy, Virology and Neurology, respectively. Extensive experiments demonstrate that TAN can model complex spatiotemporal dependencies of term-pairs for explicitly capturing the temporal evolution of relation, significantly outperforming existing state-of-the-art methods. CONCLUSION: We proposed a novel TAN to learn spatiotemporal embeddings based on pure attention mechanisms for HG. TAN learns the evolution of relationships by modeling both the continuity and difference of temporal term-pair embeddings. The important spatiotemporal dependencies of term-pair relations are extracted based solely on attention mechanism for generating hypotheses.


Subject(s)
Immunotherapy , Neurology , Learning , Neural Networks, Computer , PubMed
9.
J Comput Biol ; 31(1): 21-40, 2024 01.
Article in English | MEDLINE | ID: mdl-38170180

ABSTRACT

Single-cell data afford unprecedented insights into molecular processes. But the complexity and size of these data sets have proved challenging and given rise to a large armory of statistical and machine learning approaches. The majority of approaches focuses on either describing features of these data, or making predictions and classifying unlabeled samples. In this study, we introduce repeated decision stumping (ReDX) as a method to distill simple models from single-cell data. We develop decision trees of depth one-hence "stumps"-to identify in an inductive manner, gene products involved in driving cell fate transitions, and in applications to published data we are able to discover the key players involved in these processes in an unbiased manner without prior knowledge. Our algorithm is deliberately targeting the simplest possible candidate hypotheses that can be extracted from complex high-dimensional data. There are three reasons for this: (1) the predictions become straightforwardly testable hypotheses; (2) the identified candidates form the basis for further mechanistic model development, for example, for engineering and synthetic biology interventions; and (3) this approach complements existing descriptive modeling approaches and frameworks. The approach is computationally efficient, has remarkable predictive power, including in simulation studies where the ground truth is known, and yields robust and statistically stable predictors; the same set of candidates is generated by applying the algorithm to different subsamples of experimental data.


Subject(s)
Algorithms , Machine Learning , Computer Simulation
10.
Cogn Sci ; 48(1): e13400, 2024 01.
Article in English | MEDLINE | ID: mdl-38196160

ABSTRACT

How are new Bayesian hypotheses generated within the framework of predictive processing? This explanatory framework purports to provide a unified, systematic explanation of cognition by appealing to Bayes rule and hierarchical Bayesian machinery alone. Given that the generation of new hypotheses is fundamental to Bayesian inference, the predictive processing framework faces an important challenge in this regard. By examining several cognitive-level and neurobiological architecture-inspired models of hypothesis generation, we argue that there is an essential difference between the two types of models. Cognitive-level models do not specify how they can be implemented in brains and include structures and assumptions that are external to the predictive processing framework. By contrast, neurobiological architecture-inspired models, which aim to better resemble brain processes, fail to explain important capacities of cognition, such as categorization and few-shot learning. The "scaling-up" challenge for proponents of predictive processing is to explain the relationship between these two types of models using only the theoretical and conceptual machinery of Bayesian inference.


Subject(s)
Brain , Cognition , Humans , Bayes Theorem , Learning
11.
Foodborne Pathog Dis ; 21(2): 83-91, 2024 02.
Article in English | MEDLINE | ID: mdl-37943621

ABSTRACT

Information on the causative agent in an enteric disease outbreak can be used to generate hypotheses about the route of transmission and possible vehicles, to guide environmental assessments, and to target outbreak control measures. However, only about 40% of outbreaks reported in the United States include a confirmed etiology. The goal of this project was to identify clinical and demographic characteristics that can be used to predict the causative agent in an enteric disease outbreak and to use these data to develop an online tool for investigators to use during an outbreak when hypothesizing about the causative agent. Using data on enteric disease outbreaks from all transmission routes (animal contact, environmental contamination, foodborne, person-to-person, waterborne, unknown) reported to the U.S. Centers for Disease Control and Prevention, we developed random forest models to predict the etiology of an outbreak based on aggregated clinical and demographic characteristics at both the etiology category (i.e., bacteria, parasites, toxins, viruses) and individual etiology (Clostridium perfringens, Campylobacter, Cryptosporidium, norovirus, Salmonella, Shiga toxin-producing Escherichia coli, and Shigella) levels. The etiology category model had a kappa of 0.85 and an accuracy of 0.92, whereas the etiology-specific model had a kappa of 0.75 and an accuracy of 0.86. The highest sensitivities in the etiology category model were for bacteria and viruses; all categories had high specificities (>0.90). For the etiology-specific model, norovirus and Salmonella had the highest sensitivity and all etiologies had high specificities. When laboratory confirmation is unavailable, information on the clinical signs and symptoms reported by people associated with the outbreak, with other characteristics including case demographics and illness severity, can be used to predict the etiology or etiology category. An online publicly available tool was developed to assist investigators in their enteric disease outbreak investigations.


Subject(s)
Cryptosporidiosis , Cryptosporidium , Foodborne Diseases , Norovirus , Viruses , Animals , Humans , United States , Disease Outbreaks , Bacteria , Population Surveillance , Foodborne Diseases/microbiology
12.
medRxiv ; 2023 Oct 31.
Article in English | MEDLINE | ID: mdl-37961555

ABSTRACT

Objectives: This study aims to identify the cognitive events related to information use (e.g., "Analyze data", "Seek connection") during hypothesis generation among clinical researchers. Specifically, we describe hypothesis generation using cognitive event counts and compare them between groups. Methods: The participants used the same datasets, followed the same scripts, used VIADS (a visual interactive analysis tool for filtering and summarizing large data sets coded with hierarchical terminologies) or other analytical tools (as control) to analyze the datasets, and came up with hypotheses while following the think-aloud protocol. Their screen activities and audio were recorded and then transcribed and coded for cognitive events. Results: The VIADS group exhibited the lowest mean number of cognitive events per hypothesis and the smallest standard deviation. The experienced clinical researchers had approximately 10% more valid hypotheses than the inexperienced group. The VIADS users among the inexperienced clinical researchers exhibit a similar trend as the experienced clinical researchers in terms of the number of cognitive events and their respective percentages out of all the cognitive events. The highest percentages of cognitive events in hypothesis generation were "Using analysis results" (30%) and "Seeking connections" (23%). Conclusion: VIADS helped inexperienced clinical researchers use fewer cognitive events to generate hypotheses than the control group. This suggests that VIADS may guide participants to be more structured during hypothesis generation compared with the control group. The results provide evidence to explain the shorter average time needed by the VIADS group in generating each hypothesis.

13.
J Biomol Struct Dyn ; : 1-20, 2023 Oct 23.
Article in English | MEDLINE | ID: mdl-37870113

ABSTRACT

Thymidylate synthase (TS) is a crucial target of cancer drug discovery and is mainly involved in the De novo synthesis of the DNA precursor thymine. In the present study, to generate reliable models and identify a few promising molecules, we combined QSAR modelling with the pharmacophore hypothesis-generating technique. Input molecules were clustered on their similarity, and a cluster of 74 molecules with a pyrimidine moiety was chosen as the set for 3D-QSAR and pharmacophore modelling. Atom-based and field-based 3D-QSAR models were generated and statistically validated with R2 > 0.90 and Q2 > 0.75. The common pharmacophore hypothesis(CPH) generation identified the best six-point model ADHRRR. Using these best models, a library of FDA-approved drugs was screened for activity and filtered via molecular docking, ADME profiling, and molecular dynamics simulations. The top ten promising TS-inhibiting candidates were identified, and their chemical features profitable for TS inhibitors were explored.Communicated by Ramaswamy H. Sarma.

14.
Methods Mol Biol ; 2698: 351-360, 2023.
Article in English | MEDLINE | ID: mdl-37682484

ABSTRACT

Gene regulatory networks (GRNs) are important for determining how an organism develops and how it responds to external stimuli. In the case of Arabidopsis thaliana, several GRNs have been identified covering many important biological processes. We present AGENT, the Arabidopsis GEne Network Tool, for exploring and analyzing published GRNs. Using tools in AGENT, regulatory motifs such as feed-forward loops can be easily identified. Nodes with high centrality-and hence importance-can likewise be identified. Gene expression data can also be overlaid onto GRNs to help discover subnetworks acting in specific tissues or under certain conditions.


Subject(s)
Arabidopsis , Arabidopsis/genetics , Gene Regulatory Networks
15.
Int J Sports Physiol Perform ; 18(10): 1213-1218, 2023 Oct 01.
Article in English | MEDLINE | ID: mdl-37463668

ABSTRACT

PURPOSE: There has been a proliferation in technologies in the sport performance environment that collect increasingly larger quantities of athlete data. These data have the potential to be personal, sensitive, and revealing and raise privacy and confidentiality concerns. A solution may be the use of synthetic data, which mimic the properties of the original data. The aim of this study was to provide examples of synthetic data generation to demonstrate its practical use and to deploy a freely available web-based R Shiny application to generate synthetic data. METHODS: Openly available data from 2 previously published studies were obtained, representing typical data sets of (1) field- and gym-based team-sport external and internal load during a preseason period (n = 28) and (2) performance and subjective changes from before to after the posttraining intervention (n = 22). Synthetic data were generated using the synthpop package in R Studio software, and comparisons between the original and synthetic data sets were made through Welch t tests and the distributional similarity standardized propensity mean squared error statistic. RESULTS: There were no significant differences between the original and more synthetic data sets across all variables examined in both data sets (P > .05). Further, there was distributional similarity (ie, low standardized propensity mean squared error) between the original observed and synthetic data sets. CONCLUSIONS: These findings highlight the potential use of synthetic data as a practical solution to privacy and confidentiality issues. Synthetic data can unlock previously inaccessible data sets for exploratory analysis and facilitate multiteam or multicenter collaborations. Interested sport scientists, practitioners, and researchers should consider utilizing the shiny web application (SYNTHETIC DATA-available at https://assetlab.shinyapps.io/SyntheticData/).


Subject(s)
Privacy , Sports , Humans , Confidentiality , Software , Technology
16.
medRxiv ; 2023 Oct 31.
Article in English | MEDLINE | ID: mdl-37333271

ABSTRACT

Objectives: To compare how clinical researchers generate data-driven hypotheses with a visual interactive analytic tool (VIADS, a visual interactive analysis tool for filtering and summarizing large data sets coded with hierarchical terminologies) or other tools. Methods: We recruited clinical researchers and separated them into "experienced" and "inexperienced" groups. Participants were randomly assigned to a VIADS or control group within the groups. Each participant conducted a remote 2-hour study session for hypothesis generation with the same study facilitator on the same datasets by following a think-aloud protocol. Screen activities and audio were recorded, transcribed, coded, and analyzed. Hypotheses were evaluated by seven experts on their validity, significance, and feasibility. We conducted multilevel random effect modeling for statistical tests. Results: Eighteen participants generated 227 hypotheses, of which 147 (65%) were valid. The VIADS and control groups generated a similar number of hypotheses. The VIADS group took a significantly shorter time to generate one hypothesis (e.g., among inexperienced clinical researchers, 258 seconds versus 379 seconds, p = 0.046, power = 0.437, ICC = 0.15). The VIADS group received significantly lower ratings than the control group on feasibility and the combination rating of validity, significance, and feasibility. Conclusion: The role of VIADS in hypothesis generation seems inconclusive. The VIADS group took a significantly shorter time to generate each hypothesis. However, the combined validity, significance, and feasibility ratings of their hypotheses were significantly lower. Further characterization of hypotheses, including specifics on how they might be improved, could guide future tool development.

17.
J Biomed Inform ; 142: 104383, 2023 06.
Article in English | MEDLINE | ID: mdl-37196989

ABSTRACT

OBJECTIVE: To demonstrate and develop an approach enabling individual researchers or small teams to create their own ad-hoc, lightweight knowledge bases tailored for specialized scientific interests, using text-mining over scientific literature, and demonstrate the effectiveness of these knowledge bases in hypothesis generation and literature-based discovery (LBD). METHODS: We propose a lightweight process using an extractive search framework to create ad-hoc knowledge bases, which require minimal training and no background in bio-curation or computer science. These knowledge bases are particularly effective for LBD and hypothesis generation using Swanson's ABC method. The personalized nature of the knowledge bases allows for a somewhat higher level of noise than "public facing" ones, as researchers are expected to have prior domain experience to separate signal from noise. Fact verification is shifted from exhaustive verification of the knowledge base to post-hoc verification of specific entries of interest, allowing researchers to assess the correctness of relevant knowledge base entries by considering the paragraphs in which the facts were introduced. RESULTS: We demonstrate the methodology by constructing several knowledge bases of different kinds: three knowledge bases that support lab-internal hypothesis generation: Drug Delivery to Ovarian Tumors (DDOT); Tissue Engineering and Regeneration; Challenges in Cancer Research; and an additional comprehensive, accurate knowledge base designated as a public resource for the wider community on the topic of Cell Specific Drug Delivery (CSDD). In each case, we show the design and construction process, along with relevant visualizations for data exploration, and hypothesis generation. For CSDD and DDOT we also show meta-analysis, human evaluation, and in vitro experimental evaluation. CONCLUSION: Our approach enables researchers to create personalized, lightweight knowledge bases for specialized scientific interests, effectively facilitating hypothesis generation and literature-based discovery (LBD). By shifting fact verification efforts to post-hoc verification of specific entries, researchers can focus on exploring and generating hypotheses based on their expertise. The constructed knowledge bases demonstrate the versatility and adaptability of our approach to versatile research interests. The web-based platform, available at https://spike-kbc.apps.allenai.org, provides researchers with a valuable tool for rapid construction of knowledge bases tailored to their needs.


Subject(s)
Data Mining , Knowledge Discovery , Humans , Data Mining/methods , Knowledge Discovery/methods , Publications
18.
Environ Sci Technol ; 57(22): 8236-8244, 2023 06 06.
Article in English | MEDLINE | ID: mdl-37224396

ABSTRACT

Contemporary environmental health sciences draw on large-scale longitudinal studies to understand the impact of environmental exposures and behavior factors on the risk of disease and identify potential underlying mechanisms. In such studies, cohorts of individuals are assembled and followed up over time. Each cohort generates hundreds of publications, which are typically neither coherently organized nor summarized, hence limiting knowledge-driven dissemination. Hence, we propose a Cohort Network, a multilayer knowledge graph approach to extract exposures, outcomes, and their connections. We applied the Cohort Network on 121 peer-reviewed papers published over the past 10 years from the Veterans Affairs (VA) Normative Aging Study (NAS). The Cohort Network visualized connections between exposures and outcomes across different publications and identified key exposures and outcomes, such as air pollution, DNA methylation, and lung function. We demonstrated the utility of the Cohort Network for new hypothesis generation, e.g., identification of potential mediators of exposure-outcome associations. The Cohort Network can be used by investigators to summarize the cohort's research and facilitate knowledge-driven discovery and dissemination.


Subject(s)
Air Pollutants , Air Pollution , Humans , Air Pollutants/analysis , Pattern Recognition, Automated , Environmental Exposure/analysis , Air Pollution/analysis , Cohort Studies
19.
Cognition ; 238: 105471, 2023 09.
Article in English | MEDLINE | ID: mdl-37236019

ABSTRACT

A defining aspect of being human is an ability to reason about the world by generating and adapting ideas and hypotheses. Here we explore how this ability develops by comparing children's and adults' active search and explicit hypothesis generation patterns in a task that mimics the open-ended process of scientific induction. In our experiment, 54 children (aged 8.97±1.11) and 50 adults performed inductive inferences about a series of causal rules through active testing. Children were more elaborate in their testing behavior and generated substantially more complex guesses about the hidden rules. We take a 'computational constructivist' perspective to explaining these patterns, arguing that these inferences are driven by a combination of thinking (generating and modifying symbolic concepts) and exploring (discovering and investigating patterns in the physical world). We show how this framework and rich new dataset speak to questions about developmental differences in hypothesis generation, active learning and inductive generalization. In particular, we find children's learning is driven by less fine-tuned construction mechanisms than adults', resulting in a greater diversity of ideas but less reliable discovery of simple explanations.


Subject(s)
Child Development , Generalization, Psychological , Child , Humans , Adult
20.
JMIR Hum Factors ; 10: e44644, 2023 Apr 27.
Article in English | MEDLINE | ID: mdl-37011112

ABSTRACT

BACKGROUND: Visualization can be a powerful tool to comprehend data sets, especially when they can be represented via hierarchical structures. Enhanced comprehension can facilitate the development of scientific hypotheses. However, the inclusion of excessive data can make visualizations overwhelming. OBJECTIVE: We developed a visual interactive analytic tool for filtering and summarizing large health data sets coded with hierarchical terminologies (VIADS). In this study, we evaluated the usability of VIADS for visualizing data sets of patient diagnoses and procedures coded in the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM). METHODS: We used mixed methods in the study. A group of 12 clinical researchers participated in the generation of data-driven hypotheses using the same data sets and time frame (a 1-hour training session and a 2-hour study session) utilizing VIADS via the think-aloud protocol. The audio and screen activities were recorded remotely. A modified version of the System Usability Scale (SUS) survey and a brief survey with open-ended questions were administered after the study to assess the usability of VIADS and verify their intense usage experience with VIADS. RESULTS: The range of SUS scores was 37.5 to 87.5. The mean SUS score for VIADS was 71.88 (out of a possible 100, SD 14.62), and the median SUS was 75. The participants unanimously agreed that VIADS offers new perspectives on data sets (12/12, 100%), while 75% (8/12) agreed that VIADS facilitates understanding, presentation, and interpretation of underlying data sets. The comments on the utility of VIADS were positive and aligned well with the design objectives of VIADS. The answers to the open-ended questions in the modified SUS provided specific suggestions regarding potential improvements for VIADS, and the identified problems with usability were used to update the tool. CONCLUSIONS: This usability study demonstrates that VIADS is a usable tool for analyzing secondary data sets with good average usability, good SUS score, and favorable utility. Currently, VIADS accepts data sets with hierarchical codes and their corresponding frequencies. Consequently, only specific types of use cases are supported by the analytical results. Participants agreed, however, that VIADS provides new perspectives on data sets and is relatively easy to use. The VIADS functionalities most appreciated by participants were the ability to filter, summarize, compare, and visualize data. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): RR2-10.2196/39414.

SELECTION OF CITATIONS
SEARCH DETAIL