RESUMEN
In clinical research, the analysis of patient cohorts is a widely employed method for investigating relevant healthcare questions. The ability to automatically extract large-scale patient cohorts from hospital systems is vital in order to unlock the potential of real-world clinical data, and answer pivotal medical questions through retrospective research studies. However, existing medical data is often dispersed across various systems and databases, preventing a systematic approach to access and interoperability. Even when the data are readily accessible, clinical researchers need to sift through Electronic Medical Records, confirm ethical approval, verify status of patient consent, check the availability of imaging data, and filter the data based on disease-specific image biomarkers. We present Cohort Builder, a software pipeline designed to facilitate the creation of patient cohorts with predefined baseline characteristics from real-world ophthalmic imaging data and electronic medical records. The applicability of our approach extends beyond ophthalmology to other medical domains with similar requirements such as neurology, cardiology and orthopedics.
Asunto(s)
Registros Electrónicos de Salud , Programas Informáticos , Humanos , Diagnóstico por Imagen , Estudios de Cohortes , Oftalmopatías/diagnóstico por imagenRESUMEN
The increasing use of MAUDE reports in patient safety research highlights the importance of understanding the processing and dissemination of open-access MAUDE data. However, the absence of a structured data pipeline undermines the reproducibility and transparency of studies relying on MAUDE data. In response, we conducted a comprehensive analysis of a recent study on endoscopic clips, assessing methodologies and results. We advocate for implementing an extract, transform, and load (ETL) pipeline, utilizing openFDA and integrating keyword search strategies and data visualization techniques. This approach aims to enhance the quality of MAUDE-based studies, ensuring their reproducibility and transparency. Moreover, ETL serves as a cornerstone in data engineering, enabling real-time data management and quality assurance, thus promoting the sustainability and collaboration of MAUDE-based patient safety research.
Asunto(s)
Seguridad del Paciente , Humanos , Reproducibilidad de los ResultadosRESUMEN
With the increasing utilization of data in various industries and applications, constructing an efficient data pipeline has become crucial. In this study, we propose a machine learning operations-centric data pipeline specifically designed for an energy consumption management system. This pipeline seamlessly integrates the machine learning model with real-time data management and prediction capabilities. The overall architecture of our proposed pipeline comprises several key components, including Kafka, InfluxDB, Telegraf, Zookeeper, and Grafana. To enable accurate energy consumption predictions, we adopt two time-series prediction models, long short-term memory (LSTM), and seasonal autoregressive integrated moving average (SARIMA). Our analysis reveals a clear trade-off between speed and accuracy, where SARIMA exhibits faster model learning time while LSTM outperforms SARIMA in prediction accuracy. To validate the effectiveness of our pipeline, we measure the overall processing time by optimizing the configuration of Telegraf, which directly impacts the load in the pipeline. The results are promising, as our pipeline achieves an average end-to-end processing time of only 0.39 s for handling 10,000 data records and an impressive 1.26 s when scaling up to 100,000 records. This indicates 30.69-90.88 times faster processing compared to the existing Python-based approach. Additionally, when the number of records increases by ten times, the increased overhead is reduced by 3.07 times. This verifies that the proposed pipeline exhibits an efficient and scalable structure suitable for real-time environments.
RESUMEN
Digital transformation for the water sector has gained momentum in recent years, and many water resource recovery facilities modelers have already started transitioning from developing traditional models to digital twin (DT) applications. DTs simulate the operation of treatment plants in near real time and provide a powerful tool to the operators and process engineers for real-time scenario analysis and calamity mitigation, online process optimization, predictive maintenance, model-based control, and so forth. So far, only a few mature examples of full-scale DT implementations can be found in the literature, which only address some of the key requirements of a DT. This paper presents the development of a full-scale operational DT for the Eindhoven water resource recovery facility in The Netherlands, which includes a fully automated data-pipeline combined with a detailed mechanistic full-plant process model and a user interface co-created with the plant's operators. The automated data preprocessing pipeline provides continuous access to validated data, an influent generator provides dynamic predictions of influent composition data and allows forecasting 48 h into the future, and an advanced compartmental model of the aeration and anoxic bioreactors ensures high predictive power. The DT runs near real-time simulations every 2 h. Visualization and interaction with the DT is facilitated by the cloud-based TwinPlant technology, which was developed in close interaction with the plant's operators. A set of predefined handles are made available, allowing users to simulate hypothetical scenarios such as process and equipment failures and changes in controller settings. The combination of the advanced data pipeline and process model development used in the Eindhoven DT and the active involvement of the operators/process engineers/managers in the development process makes the twin a valuable asset for decision making with long-term reliability. PRACTITIONER POINTS: A full-scale digital twin (DT) has been developed for the Eindhoven WRRF. The Eindhoven DT includes an automated continuous data preprocessing and reconciliation pipeline. A full-plant mechanistic compartmental process model of the plant has been developed based on hydrodynamic studies. The interactive user interface of the Eindhoven DT allows operators to perform what-if scenarios on various operational settings and process inputs. Plant operators were actively involved in the DT development process to make a reliable and relevant tool with the expected added value.
Asunto(s)
Reactores Biológicos , Recursos Hídricos , Reproducibilidad de los ResultadosRESUMEN
Advancements in phenotyping technology have enabled plant science researchers to gather large volumes of information from their experiments, especially those that evaluate multiple genotypes. To fully leverage these complex and often heterogeneous data sets (i.e. those that differ in format and structure), scientists must invest considerable time in data processing, and data management has emerged as a considerable barrier for downstream application. Here, we propose a pipeline to enhance data collection, processing, and management from plant science studies comprising of two newly developed open-source programs. The first, called AgTC, is a series of programming functions that generates comma-separated values file templates to collect data in a standard format using either a lab-based computer or a mobile device. The second series of functions, AgETL, executes steps for an Extract-Transform-Load (ETL) data integration process where data are extracted from heterogeneously formatted files, transformed to meet standard criteria, and loaded into a database. There, data are stored and can be accessed for data analysis-related processes, including dynamic data visualization through web-based tools. Both AgTC and AgETL are flexible for application across plant science experiments without programming knowledge on the part of the domain scientist, and their functions are executed on Jupyter Notebook, a browser-based interactive development environment. Additionally, all parameters are easily customized from central configuration files written in the human-readable YAML format. Using three experiments from research laboratories in university and non-government organization (NGO) settings as test cases, we demonstrate the utility of AgTC and AgETL to streamline critical steps from data collection to analysis in the plant sciences.
RESUMEN
Capturing the breadth of chemical exposures in utero is critical in understanding their long-term health effects for mother and child. We explored methodological adaptations in a Non-Targeted Analysis (NTA) pipeline and evaluated the effects on chemical annotation and discovery for maternal and infant exposure. We focus on lesser-known/underreported chemicals in maternal and umbilical cord serum analyzed with liquid chromatography-quadrupole time-of-flight mass spectrometry (LC-QTOF/MS). The samples were collected from a demographically diverse cohort of 296 maternal-cord pairs (n = 592) recruited in San Francisco Bay area. We developed and evaluated two data processing pipelines, primarily differing by detection frequency cut-off, to extract chemical features from non-targeted analysis (NTA). We annotated the detected chemical features by matching with EPA CompTox Chemicals Dashboard (n = 860,000 chemicals) and Human Metabolome Database (n = 3140 chemicals) and applied a Kendrick Mass Defect filter to detect homologous series. We collected fragmentation spectra (MS/MS) on a subset of serum samples and matched to an experimental MS/MS database within the MS-Dial website and other experimental MS/MS spectra collected from standards in our lab. We annotated ~72 % of the features (total features = 32,197, levels 1-4). We confirmed 22 compounds with analytical standards, tentatively identified 88 compounds with MS/MS spectra, and annotated 4862 exogenous chemicals with an in-house developed annotation algorithm. We detected 36 chemicals that appear to not have been previously reported in human blood and 9 chemicals that were reported in less than five studies. Our findings underline the importance of NTA in the discovery of lesser-known/unreported chemicals important to characterize human exposures.
Asunto(s)
Exposoma , Espectrometría de Masas en Tándem , Femenino , Embarazo , Niño , Humanos , Cromatografía Liquida , Cromatografía Líquida con Espectrometría de Masas , San FranciscoRESUMEN
Introduction: The US Environmental Protection Agency Toxicity Forecaster (ToxCast) program makes in vitro medium- and high-throughput screening assay data publicly available for prioritization and hazard characterization of thousands of chemicals. The assays employ a variety of technologies to evaluate the effects of chemical exposure on diverse biological targets, from distinct proteins to more complex cellular processes like mitochondrial toxicity, nuclear receptor signaling, immune responses, and developmental toxicity. The ToxCast data pipeline (tcpl) is an open-source R package that stores, manages, curve-fits, and visualizes ToxCast data and populates the linked MySQL Database, invitrodb. Methods: Herein we describe major updates to tcpl and invitrodb to accommodate a new curve-fitting approach. The original tcpl curve-fitting models (constant, Hill, and gain-loss models) have been expanded to include Polynomial 1 (Linear), Polynomial 2 (Quadratic), Power, Exponential 2, Exponential 3, Exponential 4, and Exponential 5 based on BMDExpress and encoded by the R package dependency, tcplfit2. Inclusion of these models impacted invitrodb (beta version v4.0) and tcpl v3 in several ways: (1) long-format storage of generic modeling parameters to permit additional curve-fitting models; (2) updated logic for winning model selection; (3) continuous hit calling logic; and (4) removal of redundant endpoints as a result of bidirectional fitting. Results and discussion: Overall, the hit call and potency estimates were largely consistent between invitrodb v3.5 and 4.0. Tcpl and invitrodb provide a standard for consistent and reproducible curve-fitting and data management for diverse, targeted in vitro assay data with readily available documentation, thus enabling sharing and use of these data in myriad toxicology applications. The software and database updates described herein promote comparability across multiple tiers of data within the US Environmental Protection Agency CompTox Blueprint.
RESUMEN
The complexity of analysing data from IoT sensors requires the use of Big Data technologies, posing challenges such as data curation and data quality assessment. Not facing both aspects potentially can lead to erroneous decision-making (i.e., processing incorrectly treated data, introducing errors into processes, causing damage or increasing costs). This article presents ELI, an IoT-based Big Data pipeline for developing a data curation process and assessing the usability of data collected by IoT sensors in both offline and online scenarios. We propose the use of a pipeline that integrates data transformation and integration tools and a customisable decision model based on the Decision Model and Notation (DMN) to evaluate the data quality. Our study emphasises the importance of data curation and quality to integrate IoT information by identifying and discarding low-quality data that obstruct meaningful insights and introduce errors in decision making. We evaluated our approach in a smart farm scenario using agricultural humidity and temperature data collected from various types of sensors. Moreover, the proposed model exhibited consistent results in offline and online (stream data) scenarios. In addition, a performance evaluation has been developed, demonstrating its effectiveness. In summary, this article contributes to the development of a usable and effective IoT-based Big Data pipeline with data curation capabilities and assessing data usability in both online and offline scenarios. Additionally, it introduces customisable decision models for measuring data quality across multiple dimensions.
RESUMEN
The Paris Agreement was signed by 192 Parties, who committed to reducing emissions. Reaching such commitments by developing national decarbonisation strategies requires significant analyses and investment. Analyses for such strategies are often delayed due to a lack of accurate and up-to-date data for creating energy transition models. The Starter Data Kits address this issue by providing open-source, zero-level country datasets to accelerate the energy planning process. There is a strong demand for replicating the process of creating Starter Data Kits because they are currently only available for 69 countries in Africa, Asia, and South America. Using an African country as an example, this paper presents the methodology to create a Starter Data Kit made of tool-agnostic data repositories and OSeMOSYS-specific data files. The paper illustrates the steps involved, provides additional information for conducting similar work in Asia and South America, and highlights the limitations of the current version of the Starter Data Kits. Future development is proposed to expand the datasets, including new and more accurate data and new energy sectors. Therefore, this document provides instructions on the steps and materials required to develop a Starter Data Kit.â¢The methodology presented here is intended to encourage practitioners to apply it to new countries and expand the current Starter Data Kits library.â¢It is a novel process that creates data pipelines that feed into a single Data Collection and Manipulation Tool (DaCoMaTool).â¢It allows for tool-agnostic data creation in a consistent format ready for a modelling analysis using one of the available tools.
RESUMEN
Introduction: Hemagglutination inhibition (HAI) antibody titers to seasonal influenza strains are important surrogates for vaccine-elicited protection. However, HAI assays can be variable across labs, with low sensitivity across diverse viruses due to lack of standardization. Performing qualification of these assays on a strain specific level enables the precise and accurate quantification of HAI titers. Influenza A (H3N2) continues to be a predominant circulating subtype in most countries in Europe and North America since 1968 and is thus a focus of influenza vaccine research. Methods: As a part of the National Institutes of Health (NIH)-funded Collaborative Influenza Vaccine Innovation Centers (CIVICs) program, we report on the identification of a robust assay design, rigorous statistical analysis, and complete qualification of an HAI assay using A/Texas/71/2017 as a representative H3N2 strain and guinea pig red blood cells and neuraminidase (NA) inhibitor oseltamivir to prevent NA-mediated agglutination. Results: This qualified HAI assay is precise (calculated by the geometric coefficient of variation (GCV)) for intermediate precision and intra-operator variability, accurate calculated by relative error, perfectly linear (slope of -1, R-Square 1), robust (<25% GCV) and depicts high specificity and sensitivity. This HAI method was successfully qualified for another H3N2 influenza strain A/Singapore/INFIMH-16-0019/2016, meeting all pre-specified acceptance criteria. Discussion: These results demonstrate that HAI qualification and data generation for new influenza strains can be achieved efficiently with minimal extra testing and development. We report on a qualified and adaptable influenza serology method and analysis strategy to measure quantifiable HAI titers to define correlates of vaccine mediated protection in human clinical trials.
Asunto(s)
Vacunas contra la Influenza , Gripe Humana , Estados Unidos , Humanos , Animales , Cobayas , Subtipo H3N2 del Virus de la Influenza A , Hemaglutinación , Anticuerpos AntiviralesRESUMEN
As agricultural and environmental research projects become more complex, increasingly with multiple outcomes, the demand for technical support with experiment management and data handling has also increased. Interactive visualisation solutions are user-friendly and provide direct information to facilitate decision making with timely data interpretation. Existing off-the-shelf tools can be expensive and require a specialist to conduct the development of visualisation solutions. We used open-source software to develop a customised near real-time interactive dashboard system to support science experiment decision making. Our customisation allowed for: ⢠Digitalised domain knowledge via open-source solutions to develop decision support systems. ⢠Automated workflow that only executed the necessary components. ⢠Modularised solutions for low maintenance cost and upgrades.
RESUMEN
ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) has gained wide popularity as a fast, straightforward, and efficient way of generating genome-wide maps of open chromatin and guiding identification of active regulatory elements and inference of DNA protein binding locations. Given the ubiquity of this method, uniform and standardized methods for processing and assessing the quality of ATAC-seq datasets are needed. Here, we describe the data processing pipeline used by the ENCODE (Encyclopedia of DNA Elements) consortium to process ATAC-seq data into peak call sets and signal tracks and to assess the quality of these datasets.
Asunto(s)
Secuenciación de Inmunoprecipitación de Cromatina , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Cromatina , ADN/genéticaRESUMEN
The presence of chemicals causing significant adverse human health and environmental effects during end-of-life (EoL) stages is a challenge for implementing sustainable management efforts and transitioning towards a safer circular life cycle. Conducting chemical risk evaluation and exposure assessment of potential EoL scenarios can help understand the chemical EoL management chain for its safer utilization in a circular life-cycle environment. However, the first step is to track the chemical flows, estimate releases, and potential exposure pathways. Hence, this work proposes an EoL data engineering approach to perform chemical flow analysis and screening to support risk evaluation and exposure assessment for designing a safer circular life cycle of chemicals. This work uses publicly-available data to identify potential post-recycling scenarios (e.g., industrial processing/use operations), estimate inter-industry chemical transfers, and exposure pathways to chemicals of interest. A case study demonstration shows how the data engineering framework identifies, estimates, and tracks chemical flow transfers from EoL stage facilities (e.g., recycling and recovery) to upstream chemical life cycle stage facilities (e. g., manufacturing). Also, the proposed framework considers current regulatory constraints on closing the recycling loop operations and provides a range of values for the flow allocated to post-recycling uses associated with occupational exposure and fugitive air releases from EoL operations.
RESUMEN
Current research on Digital Twin (DT) is largely focused on the performance of built assets in their operational phases as well as on urban environment. However, Digital Twin has not been given enough attention to construction phases, for which this paper proposes a Digital Twin framework for the construction phase, develops a DT prototype and tests it for the use case of measuring the productivity and monitoring of earthwork operation. The DT framework and its prototype are underpinned by the principles of versatility, scalability, usability and automation to enable the DT to fulfil the requirements of large-sized earthwork projects and the dynamic nature of their operation. Cloud computing and dashboard visualisation were deployed to enable automated and repeatable data pipelines and data analytics at scale and to provide insights in near-real time. The testing of the DT prototype in a motorway project in the Northeast of England successfully demonstrated its ability to produce key insights by using the following approaches: (i) To predict equipment utilisation ratios and productivities; (ii) To detect the percentage of time spent on different tasks (i.e., loading, hauling, dumping, returning or idling), the distance travelled by equipment over time and the speed distribution; and (iii) To visualise certain earthwork operations.
RESUMEN
Educators seek to harness knowledge from educational corpora to improve student performance outcomes. Although prior studies have compared the efficacy of data mining methods (DMMs) in pipelines for forecasting student success, less work has focused on identifying a set of relevant features prior to model development and quantifying the stability of feature selection techniques. Pinpointing a subset of pertinent features can (1) reduce the number of variables that need to be managed by stakeholders, (2) make "black-box" algorithms more interpretable, and (3) provide greater guidance for faculty to implement targeted interventions. To that end, we introduce a methodology integrating feature selection with cross-validation and rank each feature on subsets of the training corpus. This modified pipeline was applied to forecast the performance of 3225 students in a baccalaureate science course using a set of 57 features, four DMMs, and four filter feature selection techniques. Correlation Attribute Evaluation (CAE) and Fisher's Scoring Algorithm (FSA) achieved significantly higher Area Under the Curve (AUC) values for logistic regression (LR) and elastic net regression (GLMNET), compared to when this pipeline step was omitted. Relief Attribute Evaluation (RAE) was highly unstable and produced models with the poorest prediction performance. Borda's method identified grade point average, number of credits taken, and performance on concept inventory assessments as the primary factors impacting predictions of student performance. We discuss the benefits of this approach when developing data pipelines for predictive modeling in undergraduate settings that are more interpretable and actionable for faculty and stakeholders. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s41239-021-00279-6.
RESUMEN
INTRODUCTION: This study proposes a novel data pipeline based on micro-computed tomographic (micro-CT) data for training the U-Net network to realize the automatic and accurate segmentation of the pulp cavity and tooth on cone-beam computed tomographic (CBCT) images. METHODS: We collected CBCT data and micro-CT data of 30 teeth. CBCT data were processed and transformed into small field of view and high-resolution CBCT images of each tooth. Twenty-five sets were randomly assigned to the training set and the remaining 5 sets to the test set. We used 2 data pipelines for U-Net network training: one manually labeled by an endodontic specialist as the control group and one processed from the micro-CT data as the experimental group. The 3-dimensional models constructed using micro-CT data in the test set were taken as the ground truth. The Dice similarity coefficient, precision rate, recall rate, average symmetric surface distance, Hausdorff distance, and morphologic analysis were used for performance evaluation. RESULTS: The segmentation accuracy of the experimental group measured by the Dice similarity coefficient, precision rate, recall rate, average symmetric surface distance, and Hausdorff distance were 96.20% ± 0.58%, 97.31% ± 0.38%, 95.11% ± 0.97%, 0.09 ± 0.01 mm, and 1.54 ± 0.51 mm in the tooth and 86.75% ± 2.42%, 84.45% ± 7.77%, 89.94% ± 4.56%, 0.08 ± 0.02 mm, and 1.99 ± 0.67 mm in the pulp cavity, respectively, which were better than the control group. Morphologic analysis suggested the segmentation results of the experimental group were better than those of the control group. CONCLUSIONS: This study proposed an automatic and accurate approach for tooth and pulp cavity segmentation on CBCT images, which can be applied in research and clinical tasks.
Asunto(s)
Inteligencia Artificial , Diente , Tomografía Computarizada de Haz Cónico , Cavidad Pulpar , Procesamiento de Imagen Asistido por Computador , Diente/diagnóstico por imagen , Microtomografía por Rayos XRESUMEN
Transverse cracks on bridge decks provide the path for chloride penetration and are the major reason for deck deterioration. For such reasons, collecting information related to the crack widths and spacing of transverse cracks are important. In this study, we focused on developing a data pipeline for automated crack detection using non-contact optical sensors. We developed a data acquisition system that is able to acquire data in a fast and simple way without obstructing traffic. Understanding that GPS is not always available and odometer sensor data can only provide relative positions along the direction of traffic, we focused on providing an alternative localization strategy only using optical sensors. In addition, to improve existing crack detection methods which mostly rely on the low-intensity and localized line-segment characteristics of cracks, we considered the direction and shape of the cracks to make our machine learning approach smarter. The proposed system may serve as a useful inspection tool for big data analytics because the system is easy to deploy and provides multiple properties of cracks. Progression of crack deterioration, if any, both in spatial and temporal scale, can be checked and compared if the system is deployed multiple times.