Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 40
Filtrar
1.
BMC Bioinformatics ; 24(1): 354, 2023 Sep 21.
Artigo em Inglês | MEDLINE | ID: mdl-37735350

RESUMO

BACKGROUND: Plummeting DNA sequencing cost in recent years has enabled genome sequencing projects to scale up by several orders of magnitude, which is transforming genomics into a highly data-intensive field of research. This development provides the much needed statistical power required for genotype-phenotype predictions in complex diseases. METHODS: In order to efficiently leverage the wealth of information, we here assessed several genomic data science tools. The rationale to focus on on-premise installations is to cope with situations where data confidentiality and compliance regulations etc. rule out cloud based solutions. We established a comprehensive qualitative and quantitative comparison between BCFtools, SnpSift, Hail, GEMINI, and OpenCGA. The tools were compared in terms of data storage technology, query speed, scalability, annotation, data manipulation, visualization, data output representation, and availability. RESULTS: Tools that leverage sophisticated data structures are noted as the most suitable for large-scale projects in varying degrees of scalability in comparison to flat-file manipulation (e.g., BCFtools, and SnpSift). Remarkably, for small to mid-size projects, even lightweight relational database. CONCLUSION: The assessment criteria provide insights into the typical questions posed in scalable genomics and serve as guidance for the development of scalable computational infrastructure in genomics.


Assuntos
Ciência de Dados , Genômica , Mapeamento Cromossômico , Bases de Dados Factuais , Análise de Sequência de DNA
2.
Cluster Comput ; : 1-18, 2023 Apr 08.
Artigo em Inglês | MEDLINE | ID: mdl-37359058

RESUMO

With the advent of ICT-based healthcare applications, various formats of health data are generated every day in huge volume. Such data, consisting of unstructured, semi-structured and structured data, has every characteristic of Big data. NoSQL databases are generally preferred for storing such type of health data with the objective of improving query performance. However, for efficient retrieval and processing of Big Health Data and for resource optimization, suitable data models and design of the NoSQL databases are important requirements. Unlike relational databases, no standard methods or tools exist for NoSQL database design. In this work, we adopt an ontology-based schema design approach. We propose that an ontology, which captures the domain knowledge, be used for developing a health data model. An ontology for primary healthcare is described in this paper. We also propose an algorithm for designing the schema of a NoSQL database, keeping in mind the characteristics of the target NoSQL store, using a related ontology, a sample query set, some statistical information of the queries, and performance requirements of the query set. The ontology proposed by us for primary healthcare domain and the above mentioned algorithm along with a set of queries are used for generating a schema targeting MongoDB datastore. The performance of the proposed design is compared with a relational model developed for the same primary healthcare data and the effectiveness of our proposed approach is demonstrated. The entire experiment has been carried out on MongoDB cloud platform.

3.
J Biomed Inform ; 114: 103670, 2021 02.
Artigo em Inglês | MEDLINE | ID: mdl-33359548

RESUMO

With the extensive adoption of electronic health records (EHRs) by several healthcare organizations, more efforts are needed to manage and utilize such massive, various, and complex healthcare data. Databases' performance and suitability to health care tasks are dramatically affected by how their data storage model and query capabilities are well-adapted to the use case scenario. On the other hand, standardized healthcare data modeling is one of the most favorable paths for achieving semantic interoperability, facilitating patient data integration from different healthcare systems. This paper compares the state-of-the-art of the most crucial database management systems used for storing standardized EHRs data. It discusses different database models' appropriateness for meeting different EHRs functions with different database specifications and workload scenarios. Insights into relevant literature show how flexible NoSQL databases (document, column, and graph) effectively deal with standardized EHRs data's distinctive features, especially in the distributed healthcare system, leading to better EHR.


Assuntos
Sistemas de Gerenciamento de Base de Dados , Registros Eletrônicos de Saúde , Bases de Dados Factuais , Atenção à Saúde , Humanos , Armazenamento e Recuperação da Informação
4.
J Biomed Inform ; 110: 103549, 2020 10.
Artigo em Inglês | MEDLINE | ID: mdl-32871286

RESUMO

In healthcare applications, developing a data model for storing patient-doctor relationships is important. Though relational models are popular for many commercial and business applications, they may not be appropriate for modeling patient-doctor relationships due to their inherent irregular nature and complexities. In this paper, as a case study, we propose to build a doctor recommendation system for the patients. The recommendation system is built on top of a multilayer graph data model. Contemporary research papers have already shown that multilayer graph data models can be efficiently used in many applications where large, heterogeneous data are to be modeled. As part of the recommendation system, the paper also introduces a concept of trust which is one important ingredient of any kind of recommendation. The trust factor introduced in the paper exploits certain characteristics of the multilayer graph model. The paper also presents some analysis to demonstrate the efficiency of the graph data model in comparison with relational data model.


Assuntos
Relações Médico-Paciente , Confiança , Bases de Dados Factuais , Humanos
5.
BMC Bioinformatics ; 20(Suppl 19): 658, 2019 Dec 24.
Artigo em Inglês | MEDLINE | ID: mdl-31870297

RESUMO

BACKGROUND: Studying structural and functional morphology of small organisms such as monogenean, is difficult due to the lack of visualization in three dimensions. One possible way to resolve this visualization issue is to create digital 3D models which may aid researchers in studying morphology and function of the monogenean. However, the development of 3D models is a tedious procedure as one will have to repeat an entire complicated modelling process for every new target 3D shape using a comprehensive 3D modelling software. This study was designed to develop an alternative 3D modelling approach to build 3D models of monogenean anchors, which can be used to understand these morphological structures in three dimensions. This alternative 3D modelling approach is aimed to avoid repeating the tedious modelling procedure for every single target 3D model from scratch. RESULT: An automated 3D modeling pipeline empowered by an Artificial Neural Network (ANN) was developed. This automated 3D modelling pipeline enables automated deformation of a generic 3D model of monogenean anchor into another target 3D anchor. The 3D modelling pipeline empowered by ANN has managed to automate the generation of the 8 target 3D models (representing 8 species: Dactylogyrus primaries, Pellucidhaptor merus, Dactylogyrus falcatus, Dactylogyrus vastator, Dactylogyrus pterocleidus, Dactylogyrus falciunguis, Chauhanellus auriculatum and Chauhanellus caelatus) of monogenean anchor from the respective 2D illustrations input without repeating the tedious modelling procedure. CONCLUSIONS: Despite some constraints and limitation, the automated 3D modelling pipeline developed in this study has demonstrated a working idea of application of machine learning approach in a 3D modelling work. This study has not only developed an automated 3D modelling pipeline but also has demonstrated a cross-disciplinary research design that integrates machine learning into a specific domain of study such as 3D modelling of the biological structures.


Assuntos
Aprendizado de Máquina , Automação Laboratorial , Imageamento Tridimensional , Software
6.
BMC Genomics ; 20(Suppl 11): 948, 2019 Dec 20.
Artigo em Inglês | MEDLINE | ID: mdl-31856721

RESUMO

BACKGROUND: Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads. METHODS: In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base. RESULTS: ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy. CONCLUSION: ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.


Assuntos
Algoritmos , Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Bases de Dados Genéticas , Genoma/genética , Mutação , Software
7.
BMC Med Inform Decis Mak ; 19(1): 25, 2019 01 28.
Artigo em Inglês | MEDLINE | ID: mdl-30691467

RESUMO

BACKGROUND: Frailty is a common clinical syndrome in ageing population that carries an increased risk for adverse health outcomes including falls, hospitalization, disability, and mortality. As these outcomes affect the health and social care planning, during the last years there is a tendency of investing in monitoring and preventing strategies. Although a number of electronic health record (EHR) systems have been developed, including personalized virtual patient models, there are limited ageing population oriented systems. METHODS: We exploit the openEHR framework for the representation of frailty in ageing population in order to attain semantic interoperability, and we present the methodology for adoption or development of archetypes. We also propose a framework for a one-to-one mapping between openEHR archetypes and a column-family NoSQL database (HBase) aiming at the integration of existing and newly developed archetypes into it. RESULTS: The requirement analysis of our study resulted in the definition of 22 coherent and clinically meaningful parameters for the description of frailty in older adults. The implemented openEHR methodology led to the direct use of 22 archetypes, the modification and reuse of two archetypes, and the development of 28 new archetypes. Additionally, the mapping procedure led to two different HBase tables for the storage of the data. CONCLUSIONS: In this work, an openEHR-based virtual patient model has been designed and integrated into an HBase storage system, exploiting the advantages of the underlying technologies. This framework can serve as a base for the development of a decision support system using the openEHR's Guideline Definition Language in the future.


Assuntos
Envelhecimento , Registros Eletrônicos de Saúde , Fragilidade , Interoperabilidade da Informação em Saúde , Modelos Teóricos , Idoso , Fragilidade/classificação , Humanos , Semântica
8.
Distrib Parallel Databases ; 37(2): 235-250, 2019 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-32661457

RESUMO

Digital imaging plays a critical role for image guided diagnosis and clinical trials, and the amount of image data is fast growing. There are two major requirements for image data management: scalability for massive scales and support of comprehensive queries. Traditional Picture Archiving and Communication Systems (PACS for short) are based on relational data management systems and suffer from limited scalability and query support. Therefore, new systems that support fast, scalable and comprehensive queries on image data are highly demanded. In this paper, we introduce two alternative approaches: DCMRL/XMLStore (RL/XML for short)-a parallel, hybrid relational and XML data management approach, and DCMDocStore (DOC for short)-a NoSQL document store approach. DCMRL/XMLStore manages DICOM images as binary large objects and metadata as relational tables and XML documents based on IBM DB2, which is parallelized through data partitioning. DCMDocStore manages DICOM metadata as JSON objects, and DICOM images as encoded attachments in MongoDB running on multiple nodes. We have delivered two open source systems DCMRL/XMLStore and DCMDocStore. Both systems support scalable data management and comprehensive queries. We also evaluated them with nearly one million DICOM images from National Biomedical Imaging Archive. The results show that, DCMDocStore demonstrates high data loading speed, high scalability and fault tolerance. DCMRL/XMLStore provides efficient queries, but comes with slower data loading. Traditional PACS systems have inherent limitations on flexible queries and scalability for massive amount of images.

9.
BMC Bioinformatics ; 19(Suppl 11): 359, 2018 Oct 22.
Artigo em Inglês | MEDLINE | ID: mdl-30343662

RESUMO

BACKGROUND: The Epi-Info software suite, built and maintained by the Centers for Disease Control and Prevention (CDC), is widely used by epidemiologists and public health researchers to collect and analyze public health data, especially in the event of outbreaks such as Ebola and Zika. As it exists today, Epi-Info Desktop runs only on the Windows platform, and the larger Epi-Info Suite of products consists of separate codebases for several different devices and use-cases. Software portability has become increasingly important over the past few years as it offers a number of obvious benefits. These include reduced development time, reduced cost, and simplified system architecture. Thus, there is a blatant need for continued research. Specifically, it is critical to fully understand any underlying negative performance issues which arise from platform-agnostic systems. Such understanding should allow for improved design, and thus result in substantial mitigation of reduced performance. In this paper, we present a viable cross-platform architecture for Epi-Info which solves many of these problems. RESULTS: We have successfully generated executables for Linux, Mac, and Windows from a single code-base, and we have shown that performance need not be completely sacrificed when building a cross-platform application. This has been accomplished by using Electron as a wrapper for an AngularJS app, a Python analytics module, and a local, browser-based NoSQL database. CONCLUSIONS: Promising results warrant future research. Specifically, the design allows for cross-platform form-design, data-collection, offline/online modes, scalable storage, automatic local-to-remote data sync, and fast analytics which rival more traditional approaches.


Assuntos
Estudos Epidemiológicos , Software , Bases de Dados Factuais , Surtos de Doenças , Humanos , Saúde Pública , Interface Usuário-Computador , Fluxo de Trabalho
10.
BMC Bioinformatics ; 19(Suppl 10): 351, 2018 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-30367571

RESUMO

BACKGROUND: Nowadays, the increasing availability of omics data, due to both the advancements in the acquisition of molecular biology results and in systems biology simulation technologies, provides the bases for precision medicine. Success in precision medicine depends on the access to healthcare and biomedical data. To this end, the digitization of all clinical exams and medical records is becoming a standard in hospitals. The digitization is essential to collect, share, and aggregate large volumes of heterogeneous data to support the discovery of hidden patterns with the aim to define predictive models for biomedical purposes. Patients' data sharing is a critical process. In fact, it raises ethical, social, legal, and technological issues that must be properly addressed. RESULTS: In this work, we present an infrastructure devised to deal with the integration of large volumes of heterogeneous biological data. The infrastructure was applied to the data collected between 2010-2016 in one of the major diagnostic analysis laboratories in Italy. Data from three different platforms were collected (i.e., laboratory exams, pathological anatomy exams, biopsy exams). The infrastructure has been designed to allow the extraction and aggregation of both unstructured and semi-structured data. Data are properly treated to ensure data security and privacy. Specialized algorithms have also been implemented to process the aggregated information with the aim to obtain a precise historical analysis of the clinical activities of one or more patients. Moreover, three Bayesian classifiers have been developed to analyze examinations reported as free text. Experimental results show that the classifiers exhibit a good accuracy when used to analyze sentences related to the sample location, diseases presence and status of the illnesses. CONCLUSIONS: The infrastructure allows the integration of multiple and heterogeneous sources of anonymized data from the different clinical platforms. Both unstructured and semi-structured data are processed to obtain a precise historical analysis of the clinical activities of one or more patients. Data aggregation allows to perform a series of statistical assessments required to answer complex questions that can be used in a variety of fields, such as predictive and precision medicine. In particular, studying the clinical history of patients that have developed similar pathologies can help to predict or individuate markers able to allow an early diagnosis of possible illnesses.


Assuntos
Big Data , Análise de Dados , Medicina de Precisão , Algoritmos , Teorema de Bayes , Biópsia , Simulação por Computador , Humanos , Aprendizado de Máquina
11.
Sensors (Basel) ; 18(9)2018 Sep 10.
Artigo em Inglês | MEDLINE | ID: mdl-30201942

RESUMO

With the rapid development of mobile devices and sensors, effective searching methods for big spatial data have recently received a significant amount of attention. Owing to their large size, many applications typically store recently generated spatial data in NoSQL databases such as HBase. As the index of HBase only supports a one-dimensional row keys, the spatial data is commonly enumerated using linearization techniques. However, the linearization techniques cannot completely guarantee the spatial proximity of data. Therefore, several studies have attempted to reduce false positives in spatial query processing by implementing a multi-dimensional indexing layer. In this paper, we propose a hierarchical indexing structure called a quadrant-based minimum bounding rectangle (QbMBR) tree for effective spatial query processing in HBase. In our method, spatial objects are grouped more precisely by using QbMBR and are indexed based on QbMBR. The QbMBR tree not only provides more selective query processing, but also reduces the storage space required for indexing. Based on the QbMBR tree index, two query-processing algorithms for range query and kNN query are also proposed in this paper. The algorithms significantly reduce query execution times by prefetching the necessary index nodes into memory while traversing the QbMBR tree. Experimental analysis demonstrates that our method significantly outperforms existing methods.

12.
BMC Med Inform Decis Mak ; 17(1): 123, 2017 Aug 18.
Artigo em Inglês | MEDLINE | ID: mdl-28821246

RESUMO

BACKGROUND: The objective of this research is to compare the relational and non-relational (NoSQL) database systems approaches in order to store, recover, query and persist standardized medical information in the form of ISO/EN 13606 normalized Electronic Health Record XML extracts, both in isolation and concurrently. NoSQL database systems have recently attracted much attention, but few studies in the literature address their direct comparison with relational databases when applied to build the persistence layer of a standardized medical information system. METHODS: One relational and two NoSQL databases (one document-based and one native XML database) of three different sizes have been created in order to evaluate and compare the response times (algorithmic complexity) of six different complexity growing queries, which have been performed on them. Similar appropriate results available in the literature have also been considered. RESULTS: Relational and non-relational NoSQL database systems show almost linear algorithmic complexity query execution. However, they show very different linear slopes, the former being much steeper than the two latter. Document-based NoSQL databases perform better in concurrency than in isolation, and also better than relational databases in concurrency. CONCLUSION: Non-relational NoSQL databases seem to be more appropriate than standard relational SQL databases when database size is extremely high (secondary use, research applications). Document-based NoSQL databases perform in general better than native XML NoSQL databases. EHR extracts visualization and edition are also document-based tasks more appropriate to NoSQL database systems. However, the appropriate database solution much depends on each particular situation and specific problem.


Assuntos
Sistemas de Gerenciamento de Base de Dados/normas , Registros Eletrônicos de Saúde/normas , Armazenamento e Recuperação da Informação/normas , Algoritmos , Bases de Dados Factuais , Padrões de Referência
13.
Sensors (Basel) ; 17(5)2017 Apr 27.
Artigo em Inglês | MEDLINE | ID: mdl-28448469

RESUMO

The development of the Internet of Things (IoT) is closely related to a considerable increase in the number and variety of devices connected to the Internet. Sensors have become a regular component of our environment, as well as smart phones and other devices that continuously collect data about our lives even without our intervention. With such connected devices, a broad range of applications has been developed and deployed, including those dealing with massive volumes of data. In this paper, we introduce a Distributed Data Service (DDS) to collect and process data for IoT environments. One central goal of this DDS is to enable multiple and distinct IoT middleware systems to share common data services from a loosely-coupled provider. In this context, we propose a new specification of functionalities for a DDS and the conception of the corresponding techniques for collecting, filtering and storing data conveniently and efficiently in this environment. Another contribution is a data aggregation component that is proposed to support efficient real-time data querying. To validate its data collecting and querying functionalities and performance, the proposed DDS is evaluated in two case studies regarding a simulated smart home system, the first case devoted to evaluating data collection and aggregation when the DDS is interacting with the UIoT middleware, and the second aimed at comparing the DDS data collection with this same functionality implemented within the Kaa middleware.

14.
J Biomed Inform ; 64: 288-295, 2016 12.
Artigo em Inglês | MEDLINE | ID: mdl-27810480

RESUMO

While the adoption of next generation sequencing has rapidly expanded, the informatics infrastructure used to manage the data generated by this technology has not kept pace. Historically, relational databases have provided much of the framework for data storage and retrieval. Newer technologies based on NoSQL architectures may provide significant advantages in storage and query efficiency, thereby reducing the cost of data management. But their relative advantage when applied to biomedical data sets, such as genetic data, has not been characterized. To this end, we compared the storage, indexing, and query efficiency of a common relational database (MySQL), a document-oriented NoSQL database (MongoDB), and a relational database with NoSQL support (PostgreSQL). When used to store genomic annotations from the dbSNP database, we found the NoSQL architectures to outperform traditional, relational models for speed of data storage, indexing, and query retrieval in nearly every operation. These findings strongly support the use of novel database technologies to improve the efficiency of data management within the biological sciences.


Assuntos
Sistemas de Gerenciamento de Base de Dados , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Armazenamento e Recuperação da Informação , Bases de Dados Genéticas , Humanos
15.
Sensors (Basel) ; 16(12)2016 Dec 14.
Artigo em Inglês | MEDLINE | ID: mdl-27983654

RESUMO

In the future, with the advent of the smart factory era, manufacturing and logistics processes will become more complex, and the complexity and criticality of traceability will further increase. This research aims at developing a performance assessment method to verify scalability when implementing traceability systems based on key technologies for smart factories, such as Internet of Things (IoT) and BigData. To this end, based on existing research, we analyzed traceability requirements and an event schema for storing traceability data in MongoDB, a document-based Not Only SQL (NoSQL) database. Next, we analyzed the algorithm of the most representative traceability query and defined a query-level performance model, which is composed of response times for the components of the traceability query algorithm. Next, this performance model was solidified as a linear regression model because the response times increase linearly by a benchmark test. Finally, for a case analysis, we applied the performance model to a virtual automobile parts logistics. As a result of the case study, we verified the scalability of a MongoDB-based traceability system and predicted the point when data node servers should be expanded in this case. The traceability system performance assessment method proposed in this research can be used as a decision-making tool for hardware capacity planning during the initial stage of construction of traceability systems and during their operational phase.

16.
J Digit Imaging ; 29(6): 716-729, 2016 12.
Artigo em Inglês | MEDLINE | ID: mdl-27440183

RESUMO

Lung cancer is the leading cause of cancer-related deaths in the world, and its main manifestation is pulmonary nodules. Detection and classification of pulmonary nodules are challenging tasks that must be done by qualified specialists, but image interpretation errors make those tasks difficult. In order to aid radiologists on those hard tasks, it is important to integrate the computer-based tools with the lesion detection, pathology diagnosis, and image interpretation processes. However, computer-aided diagnosis research faces the problem of not having enough shared medical reference data for the development, testing, and evaluation of computational methods for diagnosis. In order to minimize this problem, this paper presents a public nonrelational document-oriented cloud-based database of pulmonary nodules characterized by 3D texture attributes, identified by experienced radiologists and classified in nine different subjective characteristics by the same specialists. Our goal with the development of this database is to improve computer-aided lung cancer diagnosis and pulmonary nodule detection and classification research through the deployment of this database in a cloud Database as a Service framework. Pulmonary nodule data was provided by the Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI), image descriptors were acquired by a volumetric texture analysis, and database schema was developed using a document-oriented Not only Structured Query Language (NoSQL) approach. The proposed database is now with 379 exams, 838 nodules, and 8237 images, 4029 of them are CT scans and 4208 manually segmented nodules, and it is allocated in a MongoDB instance on a cloud infrastructure.


Assuntos
Computação em Nuvem , Bases de Dados Factuais , Diagnóstico por Computador , Neoplasias Pulmonares/diagnóstico por imagem , Nódulo Pulmonar Solitário/diagnóstico por imagem , Humanos , Interpretação de Imagem Radiográfica Assistida por Computador , Reprodutibilidade dos Testes , Tomografia Computadorizada por Raios X
17.
Phytochem Anal ; 25(6): 495-507, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24737485

RESUMO

INTRODUCTION: Sharing traditional knowledge with the scientific community could refine scientific approaches to phytochemical investigation and conservation of ethnomedicinal plants. As such, integration of traditional knowledge with scientific data using a single platform for sharing is greatly needed. However, ethnomedicinal data are available in heterogeneous formats, which depend on cultural aspects, survey methodology and focus of the study. Phytochemical and bioassay data are also available from many open sources in various standards and customised formats. OBJECTIVE: To design a flexible data model that could integrate both primary and curated ethnomedicinal plant data from multiple sources. MATERIALS AND METHODS: The current model is based on MongoDB, one of the Not only Structured Query Language (NoSQL) databases. Although it does not contain schema, modifications were made so that the model could incorporate both standard and customised ethnomedicinal plant data format from different sources. RESULTS: The model presented can integrate both primary and secondary data related to ethnomedicinal plants. Accommodation of disparate data was accomplished by a feature of this database that supported a different set of fields for each document. It also allowed storage of similar data having different properties. CONCLUSION: The model presented is scalable to a highly complex level with continuing maturation of the database, and is applicable for storing, retrieving and sharing ethnomedicinal plant data. It can also serve as a flexible alternative to a relational and normalised database.


Assuntos
Sistemas de Gerenciamento de Base de Dados , Bases de Dados Factuais , Medicina Tradicional , Modelos Estatísticos , Plantas Medicinais , Interpretação Estatística de Dados , Armazenamento e Recuperação da Informação , Sistemas Integrados e Avançados de Gestão da Informação
18.
Data Brief ; 54: 110289, 2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38586142

RESUMO

We present the 'NoSQL Injection Dataset for MongoDB, a comprehensive collection of data obtained from diverse projects focusing on NoSQL attacks on MongoDB databases. In the present era, we can classify databases into three main types: structured, semi-structured, and unstructured. While structured databases have played a prominent role in the past, unstructured databases like MongoDB are currently experiencing remarkable growth. Consequently, the vulnerabilities associated with these databases are also increasing. Hence, we have gathered a comprehensive dataset comprising 400 NoSQL injection commands. These commands are segregated into two categories: 221 malicious commands and 179 benign commands. The dataset was meticulously curated by combining both manually authored commands and those acquired through web scraping from reputable sources. The collected dataset serves as a valuable resource for studying and analysing NoSQL injection vulnerabilities, offering insights into potential security threats and aiding in the development of robust protection mechanisms against such attacks. The dataset includes a blend of complex and simple commands that have been enhanced. The dataset is well-suited for machine learning and data analysis, especially for security enthusiasts. The security professionals can use this dataset to train or fine tune the AI-models or LLMs in order to achieve higher attack detection accuracy. The security enthusiasts can also augment this dataset to generate more NoSQL commands and create robust security tools.

19.
Artigo em Inglês, Espanhol | MEDLINE | ID: mdl-38802055

RESUMO

BACKGROUND AND OBJECTIVE: The objective is to develop a model that predicts vital status six months after fracture as accurately as possible. For this purpose we will use five different data sources obtained through the National Hip Fracture Registry, the Health Management Unit and the Economic Management Department. MATERIAL AND METHODS: The study population is a cohort of patients over 74 years of age who suffered a hip fracture between May 2020 and December 2022. A warehouse is created from five different data sources with the necessary variables. An analysis of missing values and outliers as well as unbalanced classes of the target variable («vital status¼) is performed. Fourteen different algorithmic models are trained with the training. The model with the best performance is selected and a fine tuning is performed. Finally, the performance of the selected model is analyzed with test data. RESULTS: A data warehouse is created with 502 patients and 144 variables. The best performing model is Linear Regression. Sixteen of the 24 cases of deceased patients are classified as live, and 14 live patients are classified as deceased. A sensitivity of 31%, an accuracy of 34% and an area under the curve of 0.65 is achieved. CONCLUSIONS: We have not been able to generate a model for the prediction of six-month survival in the current cohort. However, we believe that the method used for the generation of algorithms based on machine learning can serve as a reference for future works.

20.
IEEE Trans Emerg Top Comput ; 11(1): 208-223, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37274839

RESUMO

NoSQL databases are being increasingly used for efficient management of high volumes of unstructured data in applications like information retrieval, natural language processing, social computing, etc. However, unlike traditional databases, data protection measures such as access control for these databases are still in their infancy, which could lead to significant vulnerabilities and security/privacy issues as their adoption increases. Attribute-based Access Control (ABAC), which provides a flexible and dynamic solution to access control, can be effective for mediating accesses in typical usage scenarios for NoSQL databases. In this paper, we propose a novel methodology for enabling ABAC in NoSQL databases. Specifically we consider MongoDB, which is one of the most popular NoSQL databases in use today. We present an approach to both specify ABAC access control policies and to enforce them when an actual access request has been made. MongoDB Wire Protocol is used for extracting and processing appropriate information from the requests. We also present a method for supporting dynamic access decisions using environmental attributes and handling of ad-hoc access requests through digitally signed user attributes. Results from an extensive set of experiments on the Enron corpus as well as on synthetically generated data demonstrate the scalability of our approach. Finally, we provide details of our implementation on MongoDB and share a Github repository so that any organization can download and deploy the same for enabling ABAC in their own MongoDB installations.

SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa