Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
1.
BMC Bioinformatics ; 24(1): 342, 2023 Sep 14.
Artículo en Inglés | MEDLINE | ID: mdl-37710192

RESUMEN

BACKGROUND: Partitioning around medoids (PAM) is one of the most widely used and successful clustering method in many fields. One of its key advantages is that it only requires a distance or a dissimilarity between the individuals, and the fact that cluster centers are actual points in the data set means they can be taken as reliable representatives of their classes. However, its wider application is hampered by the large amount of memory needed to store the distance matrix (quadratic on the number of individuals) and also by the high computational cost of computing such distance matrix and, less importantly, by the cost of the clustering algorithm itself. RESULTS: Therefore, new software has been provided that addresses these issues. This software, provided under GPL license and usable as either an R package or a C++ library, calculates in parallel the distance matrix for different distances/dissimilarities ([Formula: see text], [Formula: see text], Pearson, cosine and weighted Euclidean) and also implements a parallel fast version of PAM (FASTPAM1) using any data type to reduce memory usage. Moreover, the parallel implementation uses all the cores available in modern computers which greatly reduces the execution time. Besides its general application, the software is especially useful for processing data of single cell experiments. It has been tested in problems including clustering of single cell experiments with up to 289,000 cells with the expression of about 29,000 genes per cell. CONCLUSIONS: Comparisons with other current packages in terms of execution time have been made. The method greatly outperforms the available R packages for distance matrix calculation and also improves the packages that implement the PAM itself. The software is available as an R package at https://CRAN.R-project.org/package=scellpam and as C++ libraries at https://github.com/JdMDE/jmatlib and https://github.com/JdMDE/ppamlib The package is useful for single cell RNA-seq studies but it is also applicable in other contexts where clustering of large data sets is required.


Asunto(s)
Análisis de Expresión Génica de una Sola Célula , Programas Informáticos , Humanos , Biblioteca de Genes , Algoritmos , Análisis por Conglomerados
2.
BMC Public Health ; 23(1): 1615, 2023 08 24.
Artículo en Inglés | MEDLINE | ID: mdl-37620800

RESUMEN

BACKGROUND: Widely published findings from the COVID-19 pandemic show adverse effects on body mass index (BMI) and behavioral health in both adults and children, due to factors such as illness, job loss, and limited opportunity for physical and social activity. This study investigated whether these adverse effects were mitigated in adolescents from military families, who are universally insured with consistent access to healthcare, and who generally have at least one parent who must adhere to physical and mental fitness as a condition of employment. METHODS: We conducted a cohort study using two groups of adolescents receiving care in the U.S. Military Health System during the COVID-19 pandemic; one for changes in Body Mass Index (BMI) and the second for changes in behavioral health diagnoses, using TRICARE claims data. Beneficiaries (160,037) ages 13 to 15 years in fiscal years 2017-2018, were followed up during October 2020 to June 2021. RESULTS: Among the BMI cohort, 44.32% of underweight adolescents moved to healthy weight, 28.48% from overweight to obese, and 3.7% from healthy weight to underweight. Prevalence of behavioral disorders showed an overall 29.01% percent increase during the study period, which included in mood (86.75%) and anxiety (86.49%) disorders, suicide ideation (42.69%), and suicide attempts (77.23%). Decreases in percent change were observed in conduct disorders (-15.93%) and ADD/ADHD (-8.61%). CONCLUSIONS: Adolescents in military families experienced adverse health outcomes during the pandemic at approximately the same rates as those in non-military families, suggesting that universal insurance and military culture were not significantly mitigating factors. Obesity and underweight present significant opportunities to intervene in areas such as exercise and food access. Decreased conduct disorders and ADD/ADHD may reflect lower prevalence due to favorable home environment, or lower rates of diagnosis and referral; however, increased rates of anxiety, mood disorders, suicide ideation and attempt are especially concerning. Care should be taken to ensure that adolescents receive consistent opportunity for physical activity and social interaction, and those at risk for suicide should receive active monitoring and appropriate referral to behavioral healthcare providers.


Asunto(s)
COVID-19 , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Adulto , Niño , Humanos , Adolescente , COVID-19/epidemiología , Índice de Masa Corporal , Pandemias , Estudios de Cohortes , Estudios Retrospectivos , Delgadez
3.
Int J Mol Sci ; 23(7)2022 Mar 30.
Artículo en Inglés | MEDLINE | ID: mdl-35409169

RESUMEN

Behavioral neuroscience underwent a technology-driven revolution with the emergence of machine-vision and machine-learning technologies. These technological advances facilitated the generation of high-resolution, high-throughput capture and analysis of complex behaviors. Therefore, behavioral neuroscience is becoming a data-rich field. While behavioral researchers use advanced computational tools to analyze the resulting datasets, the search for robust and standardized analysis tools is still ongoing. At the same time, the field of genomics exploded with a plethora of technologies which enabled the generation of massive datasets. This growth of genomics data drove the emergence of powerful computational approaches to analyze these data. Here, we discuss the composition of a large behavioral dataset, and the differences and similarities between behavioral and genomics data. We then give examples of genomics-related tools that might be of use for behavioral analysis and discuss concepts that might emerge when considering the two fields together.


Asunto(s)
Genómica , Genómica/métodos
4.
Molecules ; 27(5)2022 Feb 28.
Artículo en Inglés | MEDLINE | ID: mdl-35268689

RESUMEN

Dengue is a neglected disease, present mainly in tropical countries, with more than 5.2 million cases reported in 2019. Vector control remains the most effective protective measure against dengue and other arboviruses. Synthetic insecticides based on organophosphates, pyrethroids, carbamates, neonicotinoids and oxadiazines are unattractive due to their high degree of toxicity to humans, animals and the environment. Conversely, natural-product-based larvicides/insecticides, such as essential oils, present high efficiency, low environmental toxicity and can be easily scaled up for industrial processes. However, essential oils are highly complex and require modern analytical and computational approaches to streamline the identification of bioactive substances. This study combined the GC-MS spectral similarity network approach with larvicidal assays as a new strategy for the discovery of potential bioactive substances in complex biological samples, enabling the systematic and simultaneous annotation of substances in 20 essential oils through LC50 larvicidal assays. This strategy allowed rapid intuitive discovery of distribution patterns between families and metabolic classes in clusters, and the prediction of larvicidal properties of acyclic monoterpene derivatives, including citral, neral, citronellal and citronellol, and their acetate forms (LC50 < 50 µg/mL).


Asunto(s)
Aedes , Insecticidas , Aceites Volátiles , Animales , Cromatografía de Gases y Espectrometría de Masas , Humanos , Insecticidas/farmacología , Larva , Mosquitos Vectores , Aceites Volátiles/farmacología
5.
J Sleep Res ; 29(5): e12994, 2020 10.
Artículo en Inglés | MEDLINE | ID: mdl-32067298

RESUMEN

Sleep studies face new challenges in terms of data, objectives and metrics. This requires reappraising the adequacy of existing analysis methods, including scoring methods. Visual and automatic sleep scoring of healthy individuals were compared in terms of reliability (i.e., accuracy and stability) to find a scoring method capable of giving access to the actual data variability without adding exogenous variability. A first dataset (DS1, four recordings) scored by six experts plus an autoscoring algorithm was used to characterize inter-scoring variability. A second dataset (DS2, 88 recordings) scored a few weeks later was used to explore intra-expert variability. Percentage agreements and Conger's kappa were derived from epoch-by-epoch comparisons on pairwise and consensus scorings. On DS1 the number of epochs of agreement decreased when the number of experts increased, ranging from 86% (pairwise) to 69% (all experts). Adding autoscoring to visual scorings changed the kappa value from 0.81 to 0.79. Agreement between expert consensus and autoscoring was 93%. On DS2 the hypothesis of intra-expert variability was supported by a systematic decrease in kappa scores between autoscoring used as reference and each single expert between datasets (.75-.70). Although visual scoring induces inter- and intra-expert variability, autoscoring methods can cope with intra-scorer variability, making them a sensible option to reduce exogenous variability and give access to the endogenous variability in the data.


Asunto(s)
Polisomnografía/métodos , Proyectos de Investigación/normas , Sueño/fisiología , Algoritmos , Voluntarios Sanos , Humanos , Masculino , Variaciones Dependientes del Observador , Reproducibilidad de los Resultados , Estudios Retrospectivos
6.
Entropy (Basel) ; 21(2)2019 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-33266816

RESUMEN

In this study, we present a novel method for quantifying dependencies in multivariate datasets, based on estimating the Rényi mutual information by minimum spanning trees (MSTs). The extent to which random variables are dependent is an important question, e.g., for uncertainty quantification and sensitivity analysis. The latter is closely related to the question how strongly dependent the output of, e.g., a computer simulation, is on the individual random input variables. To estimate the Rényi mutual information from data, we use a method due to Hero et al. that relies on computing minimum spanning trees (MSTs) of the data and uses the length of the MST in an estimator for the entropy. To reduce the computational cost of constructing the exact MST for large datasets, we explore methods to compute approximations to the exact MST, and find the multilevel approach introduced recently by Zhong et al. (2015) to be the most accurate. Because the MST computation does not require knowledge (or estimation) of the distributions, our methodology is well-suited for situations where only data are available. Furthermore, we show that, in the case where only the ranking of several dependencies is required rather than their exact value, it is not necessary to compute the Rényi divergence, but only an estimator derived from it. The main contributions of this paper are the introduction of this quantifier of dependency, as well as the novel combination of using approximate methods for MSTs with estimating the Rényi mutual information via MSTs. We applied our proposed method to an artificial test case based on the Ishigami function, as well as to a real-world test case involving an El Nino dataset.

7.
Biom J ; 59(2): 358-376, 2017 Mar.
Artículo en Inglés | MEDLINE | ID: mdl-27870109

RESUMEN

Data with a large p (number of covariates) and/or a large n (sample size) are now commonly encountered. For many problems, regularization especially penalization is adopted for estimation and variable selection. The straightforward application of penalization to large datasets demands a "big computer" with high computational power. To improve computational feasibility, we develop bootstrap penalization, which dissects a big penalized estimation into a set of small ones, which can be executed in a highly parallel manner and each only demands a "small computer". The proposed approach takes different strategies for data with different characteristics. For data with a large p but a small to moderate n, covariates are first clustered into relatively homogeneous blocks. The proposed approach consists of two sequential steps. In each step and for each bootstrap sample, we select blocks of covariates and run penalization. The results from multiple bootstrap samples are pooled to generate the final estimate. For data with a large n but a small to moderate p, we bootstrap a small number of subjects, apply penalized estimation, and then conduct a weighted average over multiple bootstrap samples. For data with a large p and a large n, the natural marriage of the previous two methods is applied. Numerical studies, including simulations and data analysis, show that the proposed approach has computational and numerical advantages over the straightforward application of penalization. An R package has been developed to implement the proposed methods.


Asunto(s)
Simulación por Computador , Interpretación Estadística de Datos , Humanos , Modelos Estadísticos , Tamaño de la Muestra , Programas Informáticos
8.
Behav Res Methods ; 49(4): 1227-1240, 2017 08.
Artículo en Inglés | MEDLINE | ID: mdl-27586138

RESUMEN

The game of chess has often been used for psychological investigations, particularly in cognitive science. The clear-cut rules and well-defined environment of chess provide a model for investigations of basic cognitive processes, such as perception, memory, and problem solving, while the precise rating system for the measurement of skill has enabled investigations of individual differences and expertise-related effects. In the present study, we focus on another appealing feature of chess-namely, the large archive databases associated with the game. The German national chess database presented in this study represents a fruitful ground for the investigation of multiple longitudinal research questions, since it collects the data of over 130,000 players and spans over 25 years. The German chess database collects the data of all players, including hobby players, and all tournaments played. This results in a rich and complete collection of the skill, age, and activity of the whole population of chess players in Germany. The database therefore complements the commonly used expertise approach in cognitive science by opening up new possibilities for the investigation of multiple factors that underlie expertise and skill acquisition. Since large datasets are not common in psychology, their introduction also raises the question of optimal and efficient statistical analysis. We offer the database for download and illustrate how it can be used by providing concrete examples and a step-by-step tutorial using different statistical analyses on a range of topics, including skill development over the lifetime, birth cohort effects, effects of activity and inactivity on skill, and gender differences.


Asunto(s)
Ciencia Cognitiva/métodos , Bases de Datos Factuales , Juegos Recreacionales/psicología , Modelos Psicológicos , Alemania , Humanos , Memoria
9.
Gigascience ; 132024 Jan 02.
Artículo en Inglés | MEDLINE | ID: mdl-39115958

RESUMEN

BACKGROUND: Phylogenies play a crucial role in biological research. Unfortunately, the search for the optimal phylogenetic tree incurs significant computational costs, and most of the existing state-of-the-art tools cannot deal with extremely large datasets in reasonable times. RESULTS: In this work, we introduce the new VeryFastTree code (version 4.0), which is able to construct a tree on 1 server using single-precision arithmetic from a massive 1 million alignment dataset in only 36 hours, which is 3 times and 3.2 times faster than its previous version and FastTree-2, respectively. This new version further boosts performance by parallelizing all tree traversal operations during the tree construction process, including subtree pruning and regrafting moves. Additionally, it introduces significant new features such as support for new and compressed file formats, enhanced compatibility across a broader range of operating systems, and the integration of disk computing functionality. The latter feature is particularly advantageous for users without access to high-end servers, as it allows them to manage very large datasets, albeit with an increase in computing time. CONCLUSIONS: Experimental results establish VeryFastTree as the fastest tool in the state-of-the-art for maximum likelihood phylogeny estimation. It is publicly available at https://github.com/citiususc/veryfasttree. In addition, VeryFastTree is included as a package in Bioconda, MacPorts, and all Debian-based Linux distributions.


Asunto(s)
Filogenia , Programas Informáticos , Algoritmos , Biología Computacional/métodos , Clasificación/métodos , Bases de Datos Genéticas
10.
J Affect Disord ; 365: 527-533, 2024 Nov 15.
Artículo en Inglés | MEDLINE | ID: mdl-39182518

RESUMEN

BACKGROUND: There is limited evaluation of approaches to identify patients with new onset bipolar affective disorder (BPAD) when using administrative datasets. METHODS: Using the Massachusetts All-Payer Claims Database (APCD), we identified individuals with a 2016 diagnosis of bipolar disorder with mania and examined patterns of psychiatric and medical care over the preceding 48 months. RESULTS: Among 4806 individuals aged 15-35 years with a 2016 BPAD with mania diagnosis, 3066 had 48 months of historical APCD data, and of those, 75 % involved information from ≥2 payors. After excluding individuals with historical BPAD or mania diagnoses, there were 583 individuals whose 2016 BPAD with mania diagnosis appeared to be new (i.e., 34 new diagnoses per 100,000 individuals aged 15-35 years). Most individuals received medical care, e.g., 98 % had outpatient visits, 76 % had Emergency Department (ED) visits, and 50 % had mental health-related ED visits during the 48 months prior to their first mania diagnosis. One-third (37.2 %) had a depressive episode before their initial BPAD with mania diagnosis. LIMITATIONS: Study was conducted in one state among insured individuals. We used administrative data, which permits evaluation of large populations but lacks rigorous, well-validated claims-based definitions for BPAD. There could be diagnostic uncertainty during illness course, and clinicians may differ in their diagnostic thresholds. CONCLUSIONS: Careful examination of multiple years of patient history spanning all payors is essential for identifying new onset BPAD diagnoses presenting with mania, which in turn is critical to estimating population rates of new disease and understanding the early course of disease.


Asunto(s)
Trastorno Bipolar , Manía , Aceptación de la Atención de Salud , Humanos , Adulto , Femenino , Masculino , Adolescente , Trastorno Bipolar/diagnóstico , Trastorno Bipolar/epidemiología , Adulto Joven , Aceptación de la Atención de Salud/estadística & datos numéricos , Massachusetts/epidemiología , Manía/diagnóstico , Servicio de Urgencia en Hospital/estadística & datos numéricos , Bases de Datos Factuales , Servicios de Salud Mental/estadística & datos numéricos
11.
Membranes (Basel) ; 14(7)2024 Jul 14.
Artículo en Inglés | MEDLINE | ID: mdl-39057665

RESUMEN

The ability to predict the rate of permeation of new compounds across biological membranes is of high importance for their success as drugs, as it determines their efficacy, pharmacokinetics, and safety profile. In vitro permeability assays using Caco-2 monolayers are commonly employed to assess permeability across the intestinal epithelium, with an extensive number of apparent permeability coefficient (Papp) values available in the literature and a significant fraction collected in databases. The compilation of these Papp values for large datasets allows for the application of artificial intelligence tools for establishing quantitative structure-permeability relationships (QSPRs) to predict the permeability of new compounds from their structural properties. One of the main challenges that hinders the development of accurate predictions is the existence of multiple Papp values for the same compound, mostly caused by differences in the experimental protocols employed. This review addresses the magnitude of the variability within and between laboratories to interpret its impact on QSPR modelling, systematically and quantitatively assessing the most common sources of variability. This review emphasizes the importance of compiling consistent Papp data and suggests strategies that may be used to obtain such data, contributing to the establishment of robust QSPRs with enhanced predictive power.

12.
LGBT Health ; 9(1): 54-62, 2022 01.
Artículo en Inglés | MEDLINE | ID: mdl-34882021

RESUMEN

Purpose: Sexual minority adults report worse mental health than heterosexual peers, although few empirical studies are large enough to measure variation in these disparities by sexual orientation, age, ethnicity, and socioeconomic status (SES). We investigate chronic mental health problems among sexual minority adults. Methods: Sex-disaggregated logistic regressions examined associations between self-reported chronic mental health problems and sexual orientation, age, ethnicity, and SES in a 2015-2017 dataset from the nationally representative English General Practice Patient Survey data (n = 1,341,339). Results: Bisexual adults, especially young bisexual females, reported the highest rates of chronic mental health problems. Sexual minority females 18-24 years of age had five times the odds of reporting chronic mental health problems of their heterosexual peers, with 32% of sexual minority females 18-24 years of age reporting the outcome. Sexual minority identity was also strongly associated with chronic mental health problems for adults who were White and lived in more affluent areas. Conclusion: The very high odds of chronic mental health problems among bisexual adults, especially younger bisexual females, may reflect simultaneous isolation from sexual minority and heterosexual communities. Elevated odds at younger ages may reflect disproportionate social media use and bullying. It is plausible that those who are subject to minority stress associated with SES and ethnicity may develop resilience strategies that they then apply to sexual minority stress. The results suggest that sexual minority identity is a source of minority stress, even for those who are affluent. Clinicians should be alert to the need to support the specific mental health concerns of their sexual minority patients.


Asunto(s)
Salud Mental , Minorías Sexuales y de Género , Adolescente , Adulto , Etnicidad , Femenino , Humanos , Masculino , Conducta Sexual/psicología , Clase Social , Adulto Joven
13.
MethodsX ; 9: 101660, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35345788

RESUMEN

Large sets of autocorrelated data are common in fields such as remote sensing and genomics. For example, remote sensing can produce maps of information for millions of pixels, and the information from nearby pixels will likely be spatially autocorrelated. Although there are well-established statistical methods for testing hypotheses using autocorrelated data, these methods become computationally impractical for large datasets. • The method developed here makes it feasible to perform F-tests, likelihood ratio tests, and t-tests for large autocorrelated datasets. The method involves subsetting the dataset into partitions, analyzing each partition separately, and then combining the separate tests to give an overall test. • The separate statistical tests on partitions are non-independent, because the points in different partitions are not independent. Therefore, combining separate analyses of partitions requires accounting for the non-independence of the test statistics among partitions. • The methods can be applied to a wide range of data, including not only purely spatial data but also spatiotemporal data. For spatiotemporal data, it is possible to estimate coefficients from time-series models at different spatial locations and then analyze the spatial distribution of the estimates. The spatial analysis can be simplified by estimating spatial autocorrelation directly from the spatial autocorrelation among time series.

14.
J Biol Phys ; 37(3): 263-83, 2011 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-22654177

RESUMEN

A half-center oscillator (HCO) is a common circuit building block of central pattern generator networks that produce rhythmic motor patterns in animals. Here we constructed an efficient relational database table with the resulting characteristics of the Hill et al.'s (J Comput Neurosci 10:281-302, 2001) HCO simple conductance-based model. The model consists of two reciprocally inhibitory neurons and replicates the electrical activity of the oscillator interneurons of the leech heartbeat central pattern generator under a variety of experimental conditions. Our long-range goal is to understand how this basic circuit building block produces functional activity under a variety of parameter regimes and how different parameter regimes influence stability and modulatability. By using the latest developments in computer technology, we simulated and stored large amounts of data (on the order of terabytes). We systematically explored the parameter space of the HCO and corresponding isolated neuron models using a brute-force approach. We varied a set of selected parameters (maximal conductance of intrinsic and synaptic currents) in all combinations, resulting in about 10 million simulations. We classified these HCO and isolated neuron model simulations by their activity characteristics into identifiable groups and quantified their prevalence. By querying the database, we compared the activity characteristics of the identified groups of our simulated HCO models with those of our simulated isolated neuron models and found that regularly bursting neurons compose only a small minority of functional HCO models; the vast majority was composed of spiking neurons.

15.
J Soc Psychol ; 161(5): 627-631, 2021 Sep 03.
Artículo en Inglés | MEDLINE | ID: mdl-33682612

RESUMEN

This work examines relationships between friendships and implicit preferences across two large samples. There is considerable evidence in the contact literature suggesting that friendships relate to more favorable attitudes toward outgroups, however, most evidence reflects explicit self-report measures. Using samples of 235,543 participants who completed the Disability IAT and 533,220 participants who completed the Sexuality IAT on the Project Implicit website, results indicate that participants reporting either a disabled friend or close acquaintance demonstrated weaker implicit preferences for abled over disabled people. Similarly, those with gay friends demonstrated weaker implicit preference for "straight" over gay. The size of these relationships were considerably smaller than found for explicit evaluations. These effect size estimates should be useful to researchers studying contact-implicit preference relationships as it informs power analyses and sample size planning decisions.


Asunto(s)
Actitud , Conducta Sexual , Humanos
16.
F1000Res ; 9: 1380, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-33976878

RESUMEN

The number of grey values that can be displayed on monitors and be processed by the human eye is smaller than the dynamic range of image-based sensors. This makes the visualization of such data a challenge, especially with specimens where small dim structures are equally important as large bright ones, or whenever variations in intensity, such as non-homogeneous staining efficiencies or light depth penetration, becomes an issue. While simple intensity display mappings are easily possible, these fail to provide a one-shot observation that can display objects of varying intensities. In order to facilitate the visualization-based analysis of large volumetric datasets, we developed an easy-to-use ImageJ plugin enabling the compressed display of features within several magnitudes of intensities. The Display Enhancement for Visual Inspection of Large Stacks plugin (DEVILS) homogenizes the intensities by using a combination of local and global pixel operations to allow for high and low intensities to be visible simultaneously to the human eye. The plugin is based on a single, intuitively understandable parameter, features a preview mode, and uses parallelization to process multiple image planes. As output, the plugin is capable of producing a BigDataViewer-compatible dataset for fast visualization. We demonstrate the utility of the plugin for large volumetric image data.


Asunto(s)
Procesamiento de Imagen Asistido por Computador , Luz , Humanos
17.
MethodsX ; 7: 100600, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-32021810

RESUMEN

We provide more technical details about the HLIBCov package, which is using parallel hierarchical (H-) matrices to: •Approximate large dense inhomogeneous covariance matrices with a log-linear computational cost and storage requirement.•Compute matrix-vector product, Cholesky factorization and inverse with a log-linear complexity.•Identify unknown parameters of the covariance function (variance, smoothness, and covariance length). These unknown parameters are estimated by maximizing the joint Gaussian log-likelihood function. To demonstrate the numerical performance, we identify three unknown parameters in an example with 2,000,000 locations on a PC-desktop.

18.
Artículo en Inglés | MEDLINE | ID: mdl-32165781

RESUMEN

ROC analysis involving two large datasets is an important method for analyzing statistics of interest for decision making of a classifier in many disciplines. And data dependency due to multiple use of the same subjects exists ubiquitously in order to generate more samples because of limited resources. Hence, a two-layer data structure is constructed and the nonparametric two-sample two-layer bootstrap is employed to estimate standard errors of statistics of interest derived from two sets of data, such as a weighted sum of two probabilities. In this article, to reduce the bootstrap variance and ensure the accuracy of computation, Monte Carlo studies of bootstrap variability were carried out to determine the appropriate number of bootstrap replications in ROC analysis with data dependency. It is suggested that with a tolerance 0.02 of the coefficient of variation, 2,000 bootstrap replications be appropriate under such circumstances.

19.
Front Neuroinform ; 11: 21, 2017.
Artículo en Inglés | MEDLINE | ID: mdl-28381997

RESUMEN

OpenMOLE is a scientific workflow engine with a strong emphasis on workload distribution. Workflows are designed using a high level Domain Specific Language (DSL) built on top of Scala. It exposes natural parallelism constructs to easily delegate the workload resulting from a workflow to a wide range of distributed computing environments. OpenMOLE hides the complexity of designing complex experiments thanks to its DSL. Users can embed their own applications and scale their pipelines from a small prototype running on their desktop computer to a large-scale study harnessing distributed computing infrastructures, simply by changing a single line in the pipeline definition. The construction of the pipeline itself is decoupled from the execution context. The high-level DSL abstracts the underlying execution environment, contrary to classic shell-script based pipelines. These two aspects allow pipelines to be shared and studies to be replicated across different computing environments. Workflows can be run as traditional batch pipelines or coupled with OpenMOLE's advanced exploration methods in order to study the behavior of an application, or perform automatic parameter tuning. In this work, we briefly present the strong assets of OpenMOLE and detail recent improvements targeting re-executability of workflows across various Linux platforms. We have tightly coupled OpenMOLE with CARE, a standalone containerization solution that allows re-executing on a Linux host any application that has been packaged on another Linux host previously. The solution is evaluated against a Python-based pipeline involving packages such as scikit-learn as well as binary dependencies. All were packaged and re-executed successfully on various HPC environments, with identical numerical results (here prediction scores) obtained on each environment. Our results show that the pair formed by OpenMOLE and CARE is a reliable solution to generate reproducible results and re-executable pipelines. A demonstration of the flexibility of our solution showcases three neuroimaging pipelines harnessing distributed computing environments as heterogeneous as local clusters or the European Grid Infrastructure (EGI).

20.
Commun Stat Simul Comput ; 45(5): 1689-1703, 2016.
Artículo en Inglés | MEDLINE | ID: mdl-27499571

RESUMEN

The nonparametric two-sample bootstrap is applied to computing uncertainties of measures in ROC analysis on large datasets in areas such as biometrics, speaker recognition, etc., when the analytical method cannot be used. Its validation was studied by computing the SE of the area under ROC curve using the well-established analytical Mann-Whitney-statistic method and also using the bootstrap. The analytical result is unique. The bootstrap results are expressed as a probability distribution due to its stochastic nature. The comparisons were carried out using relative errors and hypothesis testing. They match very well. This validation provides a sound foundation for such computations.

SELECCIÓN DE REFERENCIAS
Detalles de la búsqueda