Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 137
Filtrar
1.
BMC Bioinformatics ; 25(1): 96, 2024 Mar 04.
Artigo em Inglês | MEDLINE | ID: mdl-38438881

RESUMO

BACKGROUND: Bisulfite sequencing detects and quantifies DNA methylation patterns, contributing to our understanding of gene expression regulation, genome stability maintenance, conservation of epigenetic mechanisms across divergent taxa, epigenetic inheritance and, eventually, phenotypic variation. Graphical representation of methylation data is crucial in exploring epigenetic regulation on a genome-wide scale in both plants and animals. This is especially relevant for non-model organisms with poorly annotated genomes and/or organisms where genome sequences are not yet assembled on chromosome level. Despite being a technology of choice to profile DNA methylation for many years now there are surprisingly few lightweight and robust standalone tools available for efficient graphical analysis of data in non-model systems. This significantly limits evolutionary studies and agrigenomics research. BSXplorer is a tool specifically developed to fill this gap and assist researchers in explorative data analysis and in visualising and interpreting bisulfite sequencing data more easily. RESULTS: BSXplorer provides in-depth graphical analysis of sequencing data encompassing (a) profiling of methylation levels in metagenes or in user-defined regions using line plots and heatmaps, generation of summary statistics charts, (b) enabling comparative analyses of methylation patterns across experimental samples, methylation contexts and species, and (c) identification of modules sharing similar methylation signatures at functional genomic elements. The tool processes methylation data quickly and offers API and CLI capabilities, along with the ability to create high-quality figures suitable for publication. CONCLUSIONS: BSXplorer facilitates efficient methylation data mining, contrasting and visualization, making it an easy-to-use package that is highly useful for epigenetic research.


Assuntos
Metilação de DNA , Epigênese Genética , Sulfitos , Animais , Análise de Sequência de DNA , Genômica
2.
Am J Hum Genet ; 108(12): 2319-2335, 2021 12 02.
Artigo em Inglês | MEDLINE | ID: mdl-34861175

RESUMO

Modern population-scale biobanks contain simultaneous measurements of many phenotypes, providing unprecedented opportunity to study the relationship between biomarkers and disease. However, inferring causal effects from observational data is notoriously challenging. Mendelian randomization (MR) has recently received increased attention as a class of methods for estimating causal effects using genetic associations. However, standard methods result in pervasive false positives when two traits share a heritable, unobserved common cause. This is the problem of correlated pleiotropy. Here, we introduce a flexible framework for simulating traits with a common genetic confounder that generalizes recently proposed models, as well as a simple approach we call Welch-weighted Egger regression (WWER) for estimating causal effects. We show in comprehensive simulations that our method substantially reduces false positives due to correlated pleiotropy while being fast enough to apply to hundreds of phenotypes. We apply our method first to a subset of the UK Biobank consisting of blood traits and inflammatory disease, and then to a broader set of 411 heritable phenotypes. We detect many effects with strong literature support, as well as numerous behavioral effects that appear to stem from physician advice given to people at high risk for disease. We conclude that WWER is a powerful tool for exploratory data analysis in ever-growing databases of genotypes and phenotypes.


Assuntos
Reações Falso-Positivas , Pleiotropia Genética , Análise da Randomização Mendeliana/métodos , Modelos Genéticos , Análise de Regressão , Simulação por Computador , Feminino , Humanos , Inflamação/sangue , Inflamação/genética , Masculino , Análise da Randomização Mendeliana/normas , Fenótipo , Polimorfismo de Nucleotídeo Único
3.
Environ Res ; 241: 117581, 2024 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-37967705

RESUMO

Plastic consumption and its end-of-life management pose a significant environmental footprint and are energy intensive. Waste-to-resources and prevention strategies have been promoted widely in Europe as countermeasures; however, their effectiveness remains uncertain. This study aims to uncover the environmental footprint patterns of the plastics value chain in the European Union Member States (EU-27) through exploratory data analysis with dimension reduction and grouping. Nine variables are assessed, ranging from socioeconomic and demographic to environmental impacts. Three clusters are formed according to the similarity of a range of characteristics (nine), with environmental impacts being identified as the primary influencing variable in determining the clusters. Most countries belong to Cluster 0, consisting of 17 countries in 2014 and 18 countries in 2019. They represent clusters with a relatively low global warming potential (GWP), with an average value of 2.64 t CO2eq/cap in 2014 and 4.01 t CO2eq/cap in 2019. Among all the assessed countries, Denmark showed a significant change when assessed within the traits of EU-27, categorised from Cluster 1 (high GWP) in 2014 to Cluster 0 (low GWP) in 2019. The analysis of plastic packaging waste statistics in 2019 (data released in 2022) shows that, despite an increase in the recovery rate within the EU-27, the GWP has not reduced, suggesting a rebound effect. The GWP tends to increase in correlation with the higher plastic waste amount. In contrast, other environmental impacts, like eutrophication, abiotic and acidification potential, are identified to be mitigated effectively via recovery, suppressing the adverse effects of an increase in plastic waste generation. The five-year interval data analysis identified distinct clusters within a set of patterns, categorising them based on their similarities. The categorisation and managerial insights serve as a foundation for devising a focused mitigation strategy.


Assuntos
Gerenciamento de Resíduos , Gerenciamento de Resíduos/métodos , Europa (Continente) , Embalagem de Produtos , Meio Ambiente , Aquecimento Global , Plásticos , Reciclagem
4.
Artigo em Inglês | MEDLINE | ID: mdl-39082872

RESUMO

Explorative data analysis (EDA) is a critical step in scientific projects, aiming to uncover valuable insights and patterns within data. Traditionally, EDA involves manual inspection, visualization, and various statistical methods. The advent of artificial intelligence (AI) and machine learning (ML) has the potential to improve EDA, offering more sophisticated approaches that enhance its efficacy. This review explores how AI and ML algorithms can improve feature engineering and selection during EDA, leading to more robust predictive models and data-driven decisions. Tree-based models, regularized regression, and clustering algorithms were identified as key techniques. These methods automate feature importance ranking, handle complex interactions, perform feature selection, reveal hidden groupings, and detect anomalies. Real-world applications include risk prediction in total hip arthroplasty and subgroup identification in scoliosis patients. Recent advances in explainable AI and EDA automation show potential for further improvement. The integration of AI and ML into EDA accelerates tasks and uncovers sophisticated insights. However, effective utilization requires a deep understanding of the algorithms, their assumptions, and limitations, along with domain knowledge for proper interpretation. As data continues to grow, AI will play an increasingly pivotal role in EDA when combined with human expertise, driving more informed, data-driven decision-making across various scientific domains. Level of Evidence: Level V - Expert opinion.

5.
J Proteome Res ; 2023 Dec 12.
Artigo em Inglês | MEDLINE | ID: mdl-38085827

RESUMO

PMart is a web-based tool for reproducible quality control, exploratory data analysis, statistical analysis, and interactive visualization of 'omics data, based on the functionality of the pmartR R package. The newly improved user interface supports more 'omics data types, additional statistical capabilities, and enhanced options for creating downloadable graphics. PMart supports the analysis of label-free and isobaric-labeled (e.g., TMT, iTRAQ) proteomics, nuclear magnetic resonance (NMR) and mass-spectrometry (MS)-based metabolomics, MS-based lipidomics, and ribonucleic acid sequencing (RNA-seq) transcriptomics data. At the end of a PMart session, a report is available that summarizes the processing steps performed and includes the pmartR R package functions used to execute the data processing. In addition, built-in safeguards in the backend code prevent users from utilizing methods that are inappropriate based on omics data type. PMart is a user-friendly interface for conducting exploratory data analysis and statistical comparisons of omics data without programming.

6.
Genet Epidemiol ; 46(7): 390-394, 2022 10.
Artigo em Inglês | MEDLINE | ID: mdl-35642557

RESUMO

Post hoc power estimates are often requested by reviewers and/or performed by researchers after a study has been conducted. The purpose of this commentary is to provide a heuristic explanation of why post hoc power should not be used. To illustrate our point, we provide a detailed simulation study of two essentially identical research experiments hypothetically conducted in parallel at two separate universities. The simulation demonstrates that post hoc power calculations are misleading and simply not informative for data interpretation. As such, we encourage authors and peer-reviewers to avoid using or requesting post hoc power calculations.


Assuntos
Modelos Genéticos , Simulação por Computador , Humanos
7.
BMC Med ; 21(1): 182, 2023 05 15.
Artigo em Inglês | MEDLINE | ID: mdl-37189125

RESUMO

BACKGROUND: In high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions. METHODS: Advances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 "High-dimensional data" of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD. RESULTS: The paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided. CONCLUSIONS: This review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses.


Assuntos
Pesquisa Biomédica , Objetivos , Humanos , Projetos de Pesquisa
8.
Adv Exp Med Biol ; 1423: 251-256, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37525052

RESUMO

The development in the field of biomedical technology has brought significant progress in the diagnosis and prediction of many complex diseases. Part of this development is the single-cell RNA sequencing analysis, which allows the study of a complex disease in great depth at the cellular level. Such analyses can decipher the mechanisms that cause complex diseases, such as Alzheimer's disease (AD). However, the increasing depth in the collection of single-cell RNA sequencing data implies, in addition to greater challenges, the production of a large amount of information, which needs careful analysis. Toward this direction, we examine the approach to single-cell RNA sequencing data through the development of an exploratory data analysis methodology. For this purpose, a combination of various tools is presented for their effective and efficient processing. At the same time, reference is made to the relevant biological concepts, the goals and challenges of the studies, and the workflows of sequencing, preprocessing, and analysis of the data. Our framework is applied to Alzheimer's disease data providing evidence that such data are quite complex while the appropriate preprocess step can boost the machine learning processes for identifying AD signatures.


Assuntos
Doença de Alzheimer , Humanos , Doença de Alzheimer/diagnóstico , Doença de Alzheimer/genética , Análise da Expressão Gênica de Célula Única , Análise de Sequência
9.
Sensors (Basel) ; 23(8)2023 Apr 20.
Artigo em Inglês | MEDLINE | ID: mdl-37112489

RESUMO

Both paper-based and computerized exams have a high level of cheating. It is, therefore, desirable to be able to detect cheating accurately. Keeping the academic integrity of student evaluations intact is one of the biggest issues in online education. There is a substantial possibility of academic dishonesty during final exams since teachers are not directly monitoring students. We suggest a novel method in this study for identifying possible exam-cheating incidents using Machine Learning (ML) approaches. The 7WiseUp behavior dataset compiles data from surveys, sensor data, and institutional records to improve student well-being and academic performance. It offers information on academic achievement, student attendance, and behavior in general. In order to build models for predicting academic accomplishment, identifying at-risk students, and detecting problematic behavior, the dataset is designed for use in research on student behavior and performance. Our model approach surpassed all prior three-reference efforts with an accuracy of 90% and used a long short-term memory (LSTM) technique with a dropout layer, dense layers, and an optimizer called Adam. Implementing a more intricate and optimized architecture and hyperparameters is credited with increased accuracy. In addition, the increased accuracy could have been caused by how we cleaned and prepared our data. More investigation and analysis are required to determine the precise elements that led to our model's superior performance.

10.
Sensors (Basel) ; 23(21)2023 Nov 04.
Artigo em Inglês | MEDLINE | ID: mdl-37960666

RESUMO

In this paper, we propose a data classification and analysis method to estimate fire risk using facility data of thermal power plants. To estimate fire risk based on facility data, we divided facilities into three states-Steady, Transient, and Anomaly-categorized by their purposes and operational conditions. This method is designed to satisfy three requirements of fire protection systems for thermal power plants. For example, areas with fire risk must be identified, and fire risks should be classified and integrated into existing systems. We classified thermal power plants into turbine, boiler, and indoor coal shed zones. Each zone was subdivided into small pieces of equipment. The turbine, generator, oil-related equipment, hydrogen (H2), and boiler feed pump (BFP) were selected for the turbine zone, while the pulverizer and ignition oil were chosen for the boiler zone. We selected fire-related tags from Supervisory Control and Data Acquisition (SCADA) data and acquired sample data during a specific period for two thermal power plants based on inspection of fire and explosion scenarios in thermal power plants over many years. We focused on crucial fire cases such as pool fires, 3D fires, and jet fires and organized three fire hazard levels for each zone. Experimental analysis was conducted with these data set by the proposed method for 500 MW and 100 MW thermal power plants. The data classification and analysis methods presented in this paper can provide indirect experience for data analysts who do not have domain knowledge about power plant fires and can also offer good inspiration for data analysts who need to understand power plant facilities.

11.
Int J Mol Sci ; 24(14)2023 Jul 14.
Artigo em Inglês | MEDLINE | ID: mdl-37511224

RESUMO

Utilization of multivariate data analysis in catalysis research has extraordinary importance. The aim of the MIRA21 (MIskolc RAnking 21) model is to characterize heterogeneous catalysts with bias-free quantifiable data from 15 different variables to standardize catalyst characterization and provide an easy tool to compare, rank, and classify catalysts. The present work introduces and mathematically validates the MIRA21 model by identifying fundamentals affecting catalyst comparison and provides support for catalyst design. Literature data of 2,4-dinitrotoluene hydrogenation catalysts for toluene diamine synthesis were analyzed by using the descriptor system of MIRA21. In this study, exploratory data analysis (EDA) has been used to understand the relationships between individual variables such as catalyst performance, reaction conditions, catalyst compositions, and sustainable parameters. The results will be applicable in catalyst design, and using machine learning tools will also be possible.


Assuntos
Hidrogenação , Catálise
12.
Resour Conserv Recycl ; 196: 1-13, 2023 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-37476199

RESUMO

Chemical flow analysis (CFA) can be used for collecting life-cycle inventory (LCI), estimating environmental releases, and identifying potential exposure scenarios for chemicals of concern at the end-of-life (EoL) stage. Nonetheless, the demand for comprehensive data and the epistemic uncertainties about the pathway taken by the chemical flows make CFA, LCI, and exposure assessment time-consuming and challenging tasks. Due to the continuous growth of computer power and the appearance of more robust algorithms, data-driven modelling represents an attractive tool for streamlining these tasks. However, a data ingestion pipeline is required for the deployment of serving data-driven models in the real world. Hence, this work moves forward by contributing a chemical-centric and data-centric approach to extract, transform, and load comprehensive data for CFA at the EoL, integrating cross-year and country data and its provenance as part of the data lifecycle. The framework is scalable and adaptable to production-level machine learning operations. The framework can supply data at an annual rate, making it possible to deal with changes in the statistical distributions of model predictors like transferred amount and target variables (e.g., EoL activity identification) to avoid potential data-driven model performance decay over time. For instance, it can detect that recycling transfers of 643 chemicals over the reporting years (1988 to 2020) are 29.87%, 17.79%, and 20.56% for Canada, Australia, and the U.S. Finally, the developed approach enables research advancements on data-driven modelling to easily connect with other data sources for economic information on industry sectors, the economic value of chemicals, and the environmental regulatory implications that may affect the occurrence of an EoL transfer class or activity like recycling of a chemical over years and countries. Finally, stakeholders gain more context about environmental regulation stringency and economic affairs that could affect environmental decision-making and EoL chemical exposure predictions.

13.
Int J Environ Sci Technol (Tehran) ; 20(5): 5333-5348, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-35603096

RESUMO

The survival of mankind cannot be imagined without air. Consistent developments in almost all realms of modern human society affected the health of the air adversely. Daily industrial, transport, and domestic activities are stirring hazardous pollutants in our environment. Monitoring and predicting air quality have become essentially important in this era, especially in developing countries like India. In contrast to the traditional methods, the prediction technologies based on machine learning techniques are proved to be the most efficient tools to study such modern hazards. The present work investigates six years of air pollution data from 23 Indian cities for air quality analysis and prediction. The dataset is well preprocessed and key features are selected through the correlation analysis. An exploratory data analysis is exercised to develop insights into various hidden patterns in the dataset and pollutants directly affecting the air quality index are identified. A significant fall in almost all pollutants is observed in the pandemic year, 2020. The data imbalance problem is solved with a resampling technique and five machine learning models are employed to predict air quality. The results of these models are compared with the standard metrics. The Gaussian Naive Bayes model achieves the highest accuracy while the Support Vector Machine model exhibits the lowest accuracy. The performances of these models are evaluated and compared through established performance parameters. The XGBoost model performed the best among the other models and gets the highest linearity between the predicted and actual data.

14.
BMC Bioinformatics ; 23(1): 496, 2022 Nov 18.
Artigo em Inglês | MEDLINE | ID: mdl-36401182

RESUMO

Classification of different cancer types is an essential step in designing a decision support model for early cancer predictions. Using various machine learning (ML) techniques with ensemble learning is one such method used for classifications. In the present study, various ML algorithms were explored on twenty exome datasets, belonging to 5 cancer types. Initially, a data clean-up was carried out on 4181 variants of cancer with 88 features, and a derivative dataset was obtained using natural language processing and probabilistic distribution. An exploratory dataset analysis using principal component analysis was then performed in 1 and 2D axes to reduce the high-dimensionality of the data. To significantly reduce the imbalance in the derivative dataset, oversampling was carried out using SMOTE. Further, classification algorithms such as K-nearest neighbour and support vector machine were used initially on the oversampled dataset. A 4-layer artificial neural network model with 1D batch normalization was also designed to improve the model accuracy. Ensemble ML techniques such as bagging along with using KNN, SVM and MLPs as base classifiers to improve the weighted average performance metrics of the model. However, due to small sample size, model improvement was challenging. Therefore, a novel method to augment the sample size using generative adversarial network (GAN) and triplet based variational auto encoder (TVAE) was employed that reconstructed the features and labels generating the data. The results showed that from initial scrutiny, KNN showed a weighted average of 0.74 and SVM 0.76. Oversampling ensured that the accuracy of the derivative dataset improved significantly and the ensemble classifier augmented the accuracy to 82.91%, when the data was divided into 70:15:15 ratio (training, test and holdout datasets). The overall evaluation metric value when GAN and TVAE increased the sample size was found to be 0.92 with an overall comparison model of 0.66. Therefore, the present study designed an effective model for classifying cancers which when implemented to real world samples, will play a major role in early cancer diagnosis.


Assuntos
Exoma , Neoplasias , Humanos , Exoma/genética , Detecção Precoce de Câncer , Aprendizado de Máquina , Algoritmos , Neoplasias/diagnóstico , Neoplasias/genética
15.
Cytometry A ; 101(2): 177-184, 2022 02.
Artigo em Inglês | MEDLINE | ID: mdl-34559446

RESUMO

We introduce a new cell population score called SpecEnr (specific enrichment) and describe a method that discovers robust and accurate candidate biomarkers from flow cytometry data. Our approach identifies a new class of candidate biomarkers we define as driver cell populations, whose abundance is associated with a sample class (e.g., disease), but not as a result of a change in a related population. We show that the driver cell populations we find are also easily interpretable using a lattice-based visualization tool. Our method is implemented in the R package flowGraph, freely available on GitHub (github.com/aya49/flowGraph) and on BioConductor.


Assuntos
Software , Biomarcadores , Citometria de Fluxo/métodos
16.
FASEB J ; 35(12): e22024, 2021 12.
Artigo em Inglês | MEDLINE | ID: mdl-34751984

RESUMO

Alterations in mitochondrial dynamics, including their intracellular trafficking, are common early manifestations of neuronal degeneration. However, current methodologies used to study mitochondrial trafficking events rely on parameters that are primarily altered in later stages of neurodegeneration. Our objective was to establish a reliable applied statistical analysis to detect early alterations in neuronal mitochondrial trafficking. We propose a novel quantitative analysis of mitochondria trajectories based on innovative movement descriptors, including straightness, efficiency, anisotropy, and kurtosis. We evaluated time- and dose-dependent alterations in trajectory descriptors using biological data from differentiated SH-SY5Y cells treated with the mitochondrial toxicants 6-hydroxydopamine and rotenone. MitoTracker Red CMXRos-labelled mitochondria movement was analyzed by total internal reflection fluorescence microscopy followed by computational modelling to describe the process. Based on the aforementioned trajectory descriptors, this innovative analysis of mitochondria trajectories provides insights into mitochondrial movement characteristics and can be a consistent and sensitive method to detect alterations in mitochondrial trafficking occurring in the earliest time points of neurodegeneration.


Assuntos
Mitocôndrias/patologia , Dinâmica Mitocondrial , Neuroblastoma/patologia , Neurônios/patologia , Oxidopamina/efeitos adversos , Rotenona/efeitos adversos , Adrenérgicos/efeitos adversos , Diferenciação Celular , Humanos , Mitocôndrias/efeitos dos fármacos , Neuroblastoma/induzido quimicamente , Neurônios/efeitos dos fármacos , Desacopladores/efeitos adversos
17.
Popul Stud (Camb) ; 76(1): 157-168, 2022 03.
Artigo em Inglês | MEDLINE | ID: mdl-35184683

RESUMO

This study aimed to explore the phenomenon of birthdate misregistration, using birth data from 45,226,875 Polish citizens, that is, all those born 1900-2000 and registered in Poland's Universal Electronic System for Registration of the Population (PESEL). I transformed the data into a daily series of births, detrended by dividing each value by the daily average for the relevant year. Next, I selected the dates with the highest deviations based on the coefficients of the linear regression model with dummy variables. Finally, I estimated the size of the phenomenon in subsequent years by comparing the numbers of births on selected dates to their expected values. This paper is the first to document the specificity, scale, duration, and probable causes of birthdate misregistration in Poland in the twentieth century.


Assuntos
Família , Humanos , Polônia
18.
Int J Mol Sci ; 23(6)2022 Mar 10.
Artigo em Inglês | MEDLINE | ID: mdl-35328430

RESUMO

With the increase in life expectancy and consequent aging of the world's population, the prevalence of many neurodegenerative diseases is increasing, without concomitant improvement in diagnostics and therapeutics. These diseases share neuropathological hallmarks, including mitochondrial dysfunction. In fact, as mitochondrial alterations appear prior to neuronal cell death at an early phase of a disease's onset, the study and modulation of mitochondrial alterations have emerged as promising strategies to predict and prevent neurotoxicity and neuronal cell death before the onset of cell viability alterations. In this work, differentiated SH-SY5Y cells were treated with the mitochondrial-targeted neurotoxicants 6-hydroxydopamine and rotenone. These compounds were used at different concentrations and for different time points to understand the similarities and differences in their mechanisms of action. To accomplish this, data on mitochondrial parameters were acquired and analyzed using unsupervised (hierarchical clustering) and supervised (decision tree) machine learning methods. Both biochemical and computational analyses resulted in an evident distinction between the neurotoxic effects of 6-hydroxydopamine and rotenone, specifically for the highest concentrations of both compounds.


Assuntos
Fármacos Neuroprotetores , Síndromes Neurotóxicas , Apoptose , Morte Celular , Linhagem Celular Tumoral , Sobrevivência Celular , Humanos , Fármacos Neuroprotetores/farmacologia , Síndromes Neurotóxicas/etiologia , Oxidopamina/toxicidade , Rotenona/toxicidade
19.
Entropy (Basel) ; 24(10)2022 Sep 28.
Artigo em Inglês | MEDLINE | ID: mdl-37420402

RESUMO

We reformulate and reframe a series of increasingly complex parametric statistical topics into a framework of response-vs.-covariate (Re-Co) dynamics that is described without any explicit functional structures. Then we resolve these topics' data analysis tasks by discovering major factors underlying such Re-Co dynamics by only making use of data's categorical nature. The major factor selection protocol at the heart of Categorical Exploratory Data Analysis (CEDA) paradigm is illustrated and carried out by employing Shannon's conditional entropy (CE) and mutual information (I[Re;Co]) as the two key Information Theoretical measurements. Through the process of evaluating these two entropy-based measurements and resolving statistical tasks, we acquire several computational guidelines for carrying out the major factor selection protocol in a do-and-learn fashion. Specifically, practical guidelines are established for evaluating CE and I[Re;Co] in accordance with the criterion called [C1:confirmable]. Following the [C1:confirmable] criterion, we make no attempts on acquiring consistent estimations of these theoretical information measurements. All evaluations are carried out on a contingency table platform, upon which the practical guidelines also provide ways of lessening the effects of the curse of dimensionality. We explicitly carry out six examples of Re-Co dynamics, within each of which, several widely extended scenarios are also explored and discussed.

20.
Multivariate Behav Res ; 56(2): 314-328, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-30463456

RESUMO

Steinley, Hoffman, Brusco, and Sher (2017) proposed a new method for evaluating the performance of psychological network models: fixed-margin sampling. The authors investigated LASSO regularized Ising models (eLasso) by generating random datasets with the same margins as the original binary dataset, and concluded that many estimated eLasso parameters are not distinguishable from those that would be expected if the data were generated by chance. We argue that fixed-margin sampling cannot be used for this purpose, as it generates data under a particular null-hypothesis: a unidimensional factor model with interchangeable indicators (i.e., the Rasch model). We show this by discussing relevant psychometric literature and by performing simulation studies. Results indicate that while eLasso correctly estimated network models and estimated almost no edges due to chance, fixed-margin sampling performed poorly in classifying true effects as "interesting" (Steinley et al. 2017, p. 1004). Further simulation studies indicate that fixed-margin sampling offers a powerful method for highlighting local misfit from the Rasch model, but performs only moderately in identifying global departures from the Rasch model. We conclude that fixed-margin sampling is not up to the task of assessing if results from estimated Ising models or other multivariate psychometric models are due to chance.


Assuntos
Modelos Estatísticos , Projetos de Pesquisa , Simulação por Computador , Probabilidade , Psicometria
SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa