Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 10 de 10
Filtrar
1.
Am J Epidemiol ; 192(6): 995-1005, 2023 06 02.
Artigo em Inglês | MEDLINE | ID: mdl-36804665

RESUMO

Data sharing is essential for reproducibility of epidemiologic research, replication of findings, pooled analyses in consortia efforts, and maximizing study value to address multiple research questions. However, barriers related to confidentiality, costs, and incentives often limit the extent and speed of data sharing. Epidemiological practices that follow Findable, Accessible, Interoperable, Reusable (FAIR) principles can address these barriers by making data resources findable with the necessary metadata, accessible to authorized users, and interoperable with other data, to optimize the reuse of resources with appropriate credit to its creators. We provide an overview of these principles and describe approaches for implementation in epidemiology. Increasing degrees of FAIRness can be achieved by moving data and code from on-site locations to remote, accessible ("Cloud") data servers, using machine-readable and nonproprietary files, and developing open-source code. Adoption of these practices will improve daily work and collaborative analyses and facilitate compliance with data sharing policies from funders and scientific journals. Achieving a high degree of FAIRness will require funding, training, organizational support, recognition, and incentives for sharing research resources, both data and code. However, these costs are outweighed by the benefits of making research more reproducible, impactful, and equitable by facilitating the reuse of precious research resources by the scientific community.


Assuntos
Confidencialidade , Disseminação de Informação , Humanos , Reprodutibilidade dos Testes , Software , Estudos Epidemiológicos
2.
J Biomed Inform ; 107: 103455, 2020 07.
Artigo em Inglês | MEDLINE | ID: mdl-32497685

RESUMO

Modeling factors influencing disease phenotypes, from biomarker profiling study datasets, is a critical task in biomedicine. Such datasets are typically generated from high-throughput 'omic' technologies, which help examine disease mechanisms at an unprecedented resolution. These datasets are challenging because they are high-dimensional. The disease mechanisms they study are also complex because many diseases are multifactorial, resulting from the collective activity of several factors, each with a small effect. Bayesian rule learning (BRL) is a rule model inferred from learning Bayesian networks from data, and has been shown to be effective in modeling high-dimensional datasets. However, BRL is not efficient at modeling multifactorial diseases since it suffers from data fragmentation during learning. In this paper, we overcome this limitation by implementing and evaluating three types of ensemble model combination strategies with BRL- uniform combination (UC; same as Bagging), Bayesian model averaging (BMA), and Bayesian model combination (BMC)- collectively called Ensemble Bayesian Rule Learning (EBRL). We also introduce a novel method to visualize EBRL models, called the Bayesian Rule Ensemble Visualizing tool (BREVity), which helps extract interpret the most important rule patterns guiding the predictions made by the ensemble model. Our results using twenty-five public, high-dimensional, gene expression datasets of multifactorial diseases, suggest that, both EBRL models using UC and BMC achieve better predictive performance than BMA and other classic machine learning methods. Furthermore, BMC is found to be more reliable than UC, when the ensemble includes sub-optimal models resulting from the stochasticity of the model search process. Together, EBRL and BREVity provides researchers a promising and novel tool for modeling multifactorial diseases from high-dimensional datasets that leverages strengths of ensemble methods for predictive performance, while also providing interpretable explanations for its predictions.


Assuntos
Aprendizado de Máquina , Teorema de Bayes
4.
BMC Cancer ; 16: 184, 2016 Mar 04.
Artigo em Inglês | MEDLINE | ID: mdl-26944944

RESUMO

BACKGROUND: Adenocarcinoma (ADC) and squamous cell carcinoma (SCC) are the most prevalent histological types among lung cancers. Distinguishing between these subtypes is critically important because they have different implications for prognosis and treatment. Normally, histopathological analyses are used to distinguish between the two, where the tissue samples are collected based on small endoscopic samples or needle aspirations. However, the lack of cell architecture in these small tissue samples hampers the process of distinguishing between the two subtypes. Molecular profiling can also be used to discriminate between the two lung cancer subtypes, on condition that the biopsy is composed of at least 50 % of tumor cells. However, for some cases, the tissue composition of a biopsy might be a mix of tumor and tumor-adjacent histologically normal tissue (TAHN). When this happens, a new biopsy is required, with associated cost, risks and discomfort to the patient. To avoid this problem, we hypothesize that a computational method can distinguish between lung cancer subtypes given tumor and TAHN tissue. METHODS: Using publicly available datasets for gene expression and DNA methylation, we applied four classification tasks, depending on the possible combinations of tumor and TAHN tissue. First, we used a feature selector (ReliefF/Limma) to select relevant variables, which were then used to build a simple naïve Bayes classification model. Then, we evaluated the classification performance of our models by measuring the area under the receiver operating characteristic curve (AUC). Finally, we analyzed the relevance of the selected genes using hierarchical clustering and IPA® software for gene functional analysis. RESULTS: All Bayesian models achieved high classification performance (AUC > 0.94), which were confirmed by hierarchical cluster analysis. From the genes selected, 25 (93 %) were found to be related to cancer (19 were associated with ADC or SCC), confirming the biological relevance of our method. CONCLUSIONS: The results from this study confirm that computational methods using tumor and TAHN tissue can serve as a prognostic tool for lung cancer subtype classification. Our study complements results from other studies where TAHN tissue has been used as prognostic tool for prostate cancer. The clinical implications of this finding could greatly benefit lung cancer patients.


Assuntos
Genômica/métodos , Neoplasias Pulmonares/diagnóstico , Neoplasias Pulmonares/genética , Adenocarcinoma/diagnóstico , Adenocarcinoma/genética , Teorema de Bayes , Carcinoma de Células Escamosas/diagnóstico , Carcinoma de Células Escamosas/genética , Análise por Conglomerados , Biologia Computacional/métodos , Metilação de DNA , Bases de Dados de Ácidos Nucleicos , Conjuntos de Dados como Assunto , Perfilação da Expressão Gênica/métodos , Regulação Neoplásica da Expressão Gênica , Redes Reguladoras de Genes , Humanos , Prognóstico , Reprodutibilidade dos Testes
5.
JAMIA Open ; 7(2): ooae055, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38938691

RESUMO

Objectives: Absolute risk models estimate an individual's future disease risk over a specified time interval. Applications utilizing server-side risk tooling, the R-based iCARE (R-iCARE), to build, validate, and apply absolute risk models, face limitations in portability and privacy due to their need for circulating user data in remote servers for operation. We overcome this by porting iCARE to the web platform. Materials and Methods: We refactored R-iCARE into a Python package (Py-iCARE) and then compiled it to WebAssembly (Wasm-iCARE)-a portable web module, which operates within the privacy of the user's device. Results: We showcase the portability and privacy of Wasm-iCARE through 2 applications: for researchers to statistically validate risk models and to deliver them to end-users. Both applications run entirely on the client side, requiring no downloads or installations, and keep user data on-device during risk calculation. Conclusions: Wasm-iCARE fosters accessible and privacy-preserving risk tools, accelerating their validation and delivery.

6.
ArXiv ; 2023 Oct 13.
Artigo em Inglês | MEDLINE | ID: mdl-37873020

RESUMO

Objective: Absolute risk models estimate an individual's future disease risk over a specified time interval. Applications utilizing server-side risk tooling, such as the R-based iCARE (R-iCARE), to build, validate, and apply absolute risk models, face serious limitations in portability and privacy due to their need for circulating user data in remote servers for operation. Our objective was to overcome these limitations. Materials and Methods: We refactored R-iCARE into a Python package (Py-iCARE) then compiled it to WebAssembly (Wasm-iCARE): a portable web module, which operates entirely within the privacy of the user's device. Results: We showcase the portability and privacy of Wasm-iCARE through two applications: for researchers to statistically validate risk models, and to deliver them to end-users. Both applications run entirely on the client-side, requiring no downloads or installations, and keeps user data on-device during risk calculation. Conclusions: Wasm-iCARE fosters accessible and privacy-preserving risk tools, accelerating their validation and delivery.

7.
Bioinform Adv ; 3(1): vbad145, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37868335

RESUMO

Motivation: Currently, the Polygenic Score (PGS) Catalog curates over 400 publications on over 500 traits corresponding to over 3000 polygenic risk scores (PRSs). To assess the feasibility of privately calculating the underlying multivariate relative risk for individuals with consumer genomics data, we developed an in-browserPRS calculator for genomic data that does not circulate any data or engage in any computation outside of the user's personal device. Results: A prototype personal risk score calculator, created for research purposes, was developed to demonstrate how the PGS Catalog can be privately and readily applied to readily available direct-to-consumer genetic testing services, such as 23andMe. No software download, installation, or configuration is needed. The PRS web calculator matches individual PGS catalog entries with an individual's 23andMe genome data composed of 600k to 1.4 M single-nucleotide polymorphisms (SNPs). Beta coefficients provide researchers with a convenient assessment of risk associated with matched SNPs. This in-browser application was tested in a variety of personal devices, including smartphones, establishing the feasibility of privately calculating personal risk scores with up to a few thousand reference genetic variations and from the full 23andMe SNP data file (compressed or not). Availability and implementation: The PRScalc web application is developed in JavaScript, HTML, and CSS and is available at GitHub repository (https://episphere.github.io/prs) under an MIT license. The datasets were derived from sources in the public domain: [PGS Catalog, Personal Genome Project].

8.
World J Clin Oncol ; 9(5): 98-109, 2018 Sep 14.
Artigo em Inglês | MEDLINE | ID: mdl-30254965

RESUMO

AIM: To develop a framework to incorporate background domain knowledge into classification rule learning for knowledge discovery in biomedicine. METHODS: Bayesian rule learning (BRL) is a rule-based classifier that uses a greedy best-first search over a space of Bayesian belief-networks (BN) to find the optimal BN to explain the input dataset, and then infers classification rules from this BN. BRL uses a Bayesian score to evaluate the quality of BNs. In this paper, we extended the Bayesian score to include informative structure priors, which encodes our prior domain knowledge about the dataset. We call this extension of BRL as BRLp. The structure prior has a λ hyperparameter that allows the user to tune the degree of incorporation of the prior knowledge in the model learning process. We studied the effect of λ on model learning using a simulated dataset and a real-world lung cancer prognostic biomarker dataset, by measuring the degree of incorporation of our specified prior knowledge. We also monitored its effect on the model predictive performance. Finally, we compared BRLp to other state-of-the-art classifiers commonly used in biomedicine. RESULTS: We evaluated the degree of incorporation of prior knowledge into BRLp, with simulated data by measuring the Graph Edit Distance between the true data-generating model and the model learned by BRLp. We specified the true model using informative structure priors. We observed that by increasing the value of λ we were able to increase the influence of the specified structure priors on model learning. A large value of λ of BRLp caused it to return the true model. This also led to a gain in predictive performance measured by area under the receiver operator characteristic curve (AUC). We then obtained a publicly available real-world lung cancer prognostic biomarker dataset and specified a known biomarker from literature [the epidermal growth factor receptor (EGFR) gene]. We again observed that larger values of λ led to an increased incorporation of EGFR into the final BRLp model. This relevant background knowledge also led to a gain in AUC. CONCLUSION: BRLp enables tunable structure priors to be incorporated during Bayesian classification rule learning that integrates data and knowledge as demonstrated using lung cancer biomarker data.

9.
Data (Basel) ; 2(1)2017 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-28331847

RESUMO

The comprehensibility of good predictive models learned from high-dimensional gene expression data is attractive because it can lead to biomarker discovery. Several good classifiers provide comparable predictive performance but differ in their abilities to summarize the observed data. We extend a Bayesian Rule Learning (BRL-GSS) algorithm, previously shown to be a significantly better predictor than other classical approaches in this domain. It searches a space of Bayesian networks using a decision tree representation of its parameters with global constraints, and infers a set of IF-THEN rules. The number of parameters and therefore the number of rules are combinatorial to the number of predictor variables in the model. We relax these global constraints to a more generalizable local structure (BRL-LSS). BRL-LSS entails more parsimonious set of rules because it does not have to generate all combinatorial rules. The search space of local structures is much richer than the space of global structures. We design the BRL-LSS with the same worst-case time-complexity as BRL-GSS while exploring a richer and more complex model space. We measure predictive performance using Area Under the ROC curve (AUC) and Accuracy. We measure model parsimony performance by noting the average number of rules and variables needed to describe the observed data. We evaluate the predictive and parsimony performance of BRL-GSS, BRL-LSS and the state-of-the-art C4.5 decision tree algorithm, across 10-fold cross-validation using ten microarray gene-expression diagnostic datasets. In these experiments, we observe that BRL-LSS is similar to BRL-GSS in terms of predictive performance, while generating a much more parsimonious set of rules to explain the same observed data. BRL-LSS also needs fewer variables than C4.5 to explain the data with similar predictive performance. We also conduct a feasibility study to demonstrate the general applicability of our BRL methods on the newer RNA sequencing gene-expression data.

10.
Artigo em Inglês | MEDLINE | ID: mdl-24303297

RESUMO

Technology is constantly evolving, necessitating the development of workflows for efficient use of high-dimensional data. We develop and test an empirical workflow for predictive modeling based on single nucleotide polymorphisms (SNP) from genome-wide association study (GWAS) datasets. To this aim, we use as a case study SNP-based prediction of survival for non-small cell lung cancer (NSCLC) with a Bayesian rule learner system (BRL+). Lung cancer is a leading cause of mortality. Standard treatment for early stages of NSCLC is surgery. Adjuvant chemotherapy would be beneficial for patients with early recurrence; consequently, we need models capable of such prediction. This workflow outlines the challenges involved in processing GWAS datasets from one popular platform (Affymetrix®), from the results files of the hybridization experiment to the model construction. Our results show that our workflow is feasible and efficient for processing such data while also yielding SNP based models with high predictive accuracy over cross validation.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA