Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 89
Filtrar
Más filtros

País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
J Synchrotron Radiat ; 31(Pt 3): 635-645, 2024 May 01.
Artículo en Inglés | MEDLINE | ID: mdl-38656774

RESUMEN

With the development of synchrotron radiation sources and high-frame-rate detectors, the amount of experimental data collected at synchrotron radiation beamlines has increased exponentially. As a result, data processing for synchrotron radiation experiments has entered the era of big data. It is becoming increasingly important for beamlines to have the capability to process large-scale data in parallel to keep up with the rapid growth of data. Currently, there is no set of data processing solutions based on the big data technology framework for beamlines. Apache Hadoop is a widely used distributed system architecture for solving the problem of massive data storage and computation. This paper presents a set of distributed data processing schemes for beamlines with experimental data using Hadoop. The Hadoop Distributed File System is utilized as the distributed file storage system, and Hadoop YARN serves as the resource scheduler for the distributed computing cluster. A distributed data processing pipeline that can carry out massively parallel computation is designed and developed using Hadoop Spark. The entire data processing platform adopts a distributed microservice architecture, which makes the system easy to expand, reduces module coupling and improves reliability.

2.
Educ Inf Technol (Dordr) ; : 1-26, 2023 Jan 24.
Artículo en Inglés | MEDLINE | ID: mdl-36714440

RESUMEN

This article discusses the key elements of the Data Science Technology course offered to postgraduate students enrolled in the Master of Data Science program. This course complements the existing curriculum by providing the skills to handle the Big Data platform and tools, in addition to data science activities. We tackle the discussion about this course based on three main requirements, which are related to the need to exploit the key skills from two dimensions, namely, Data Science and Big Data, and the need for a cluster-based computing platform and its accessibility. We address these requirements by presenting the course design and its assessments, the configuration of the computing platform, and the strategy to enable flexible accessibility. In terms of course design, the offered course contributes to several innovative elements and has covered multiple key areas of the data science body of knowledge and multiple quadrants of the job and skills matrix. In the case of the computing platform, a stable deployment of a Hadoop cluster with flexible accessibility, triggered by the pandemic situation, has been established. Furthermore, through our experience with the implementation of the cluster, it has shown the ability of the cluster to handle computing problems with a larger dataset than the one used for the semesters within the scope of the study. We also provide some reflections and highlight future improvements.

3.
Am J Hum Genet ; 105(5): 974-986, 2019 11 07.
Artículo en Inglés | MEDLINE | ID: mdl-31668702

RESUMEN

The advent of inexpensive, clinical exome sequencing (ES) has led to the accumulation of genetic data from thousands of samples from individuals affected with a wide range of diseases, but for whom the underlying genetic and molecular etiology of their clinical phenotype remains unknown. In many cases, detailed phenotypes are unavailable or poorly recorded and there is little family history to guide study. To accelerate discovery, we integrated ES data from 18,696 individuals referred for suspected Mendelian disease, together with relatives, in an Apache Hadoop data lake (Hadoop Architecture Lake of Exomes [HARLEE]) and implemented a genocentric analysis that rapidly identified 154 genes harboring variants suspected to cause Mendelian disorders. The approach did not rely on case-specific phenotypic classifications but was driven by optimization of gene- and variant-level filter parameters utilizing historical Mendelian disease-gene association discovery data. Variants in 19 of the 154 candidate genes were subsequently reported as causative of a Mendelian trait and additional data support the association of all other candidate genes with disease endpoints.


Asunto(s)
Enfermedades Genéticas Congénitas/genética , Predisposición Genética a la Enfermedad/genética , Variación Genética/genética , Bases de Datos Genéticas , Exoma/genética , Genómica/métodos , Humanos , Linaje , Fenotipo , Secuenciación del Exoma/métodos
4.
Brief Bioinform ; 21(1): 96-105, 2020 Jan 17.
Artículo en Inglés | MEDLINE | ID: mdl-30462158

RESUMEN

The paper reviews the use of the Hadoop platform in structural bioinformatics applications. For structural bioinformatics, Hadoop provides a new framework to analyse large fractions of the Protein Data Bank that is key for high-throughput studies of, for example, protein-ligand docking, clustering of protein-ligand complexes and structural alignment. Specifically we review in the literature a number of implementations using Hadoop of high-throughput analyses and their scalability. We find that these deployments for the most part use known executables called from MapReduce rather than rewriting the algorithms. The scalability exhibits a variable behaviour in comparison with other batch schedulers, particularly as direct comparisons on the same platform are generally not available. Direct comparisons of Hadoop with batch schedulers are absent in the literature but we note there is some evidence that Message Passing Interface implementations scale better than Hadoop. A significant barrier to the use of the Hadoop ecosystem is the difficulty of the interface and configuration of a resource to use Hadoop. This will improve over time as interfaces to Hadoop, e.g. Spark improve, usage of cloud platforms (e.g. Azure and Amazon Web Services (AWS)) increases and standardised approaches such as Workflow Languages (i.e. Workflow Definition Language, Common Workflow Language and Nextflow) are taken up.

5.
Sensors (Basel) ; 22(20)2022 Oct 12.
Artículo en Inglés | MEDLINE | ID: mdl-36298077

RESUMEN

In recent years, anomaly detection and machine learning for intrusion detection systems have been used to detect anomalies on Internet of Things networks. These systems rely on machine and deep learning to improve the detection accuracy. However, the robustness of the model depends on the number of datasamples available, quality of the data, and the distribution of the data classes. In the present paper, we focused specifically on the amount of data and class imbalanced since both parameters are key in IoT due to the fact that network traffic is increasing exponentially. For this reason, we propose a framework that uses a big data methodology with Hadoop-Spark to train and test multi-class and binary classification with one-vs-rest strategy for intrusion detection using the entire BoT IoT dataset. Thus, we evaluate all the algorithms available in Hadoop-Spark in terms of accuracy and processing time. In addition, since the BoT IoT dataset used is highly imbalanced, we also improve the accuracy for detecting minority classes by generating more datasamples using a Conditional Tabular Generative Adversarial Network (CTGAN). In general, our proposed model outperforms other published models including our previous model. Using our proposed methodology, the F1-score of one of the minority class, i.e., Theft attack was improved from 42% to 99%.

6.
Cluster Comput ; 25(2): 1355-1372, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-35068996

RESUMEN

Distributed denial of service (DDoS) is an immense threat for Internet based-applications and their resources. It immediately floods the victim system by transmitting a large number of network packets, and due to this, the victim system resources become unavailable for legitimate users. Therefore, this attack is claimed to be a dangerous attack for Internet-based applications and their resources. Several security approaches have been proposed in the literature to protect Internet-based applications from this type of threat. However, the frequency and strength of DDoS attacks are increasing day-by-day. Further, most of the traditional and distributed processing frameworks-based DDoS attack detection systems analyzed network flows in offline batch processing. Hence, they failed to classify network flows in real-time. This paper proposes a novel Spark Streaming and Kafka-based distributed classification system, named by SSK-DDoS, for classifying different types of DDoS attacks and legitimate network flows. This classification approach is implemented using a distributed Spark MLlib machine learning algorithms on a Hadoop cluster and deployed on the Spark streaming platform to classify streams in real-time. The incoming streams consume by Kafka's topic to perform preprocessing tasks such as extracting and formulating features for classifying them into seven groups: Benign, DDoS-DNS, DDoS-LDAP, DDoS-MSSQL, DDoS-NetBIOS, DDoS-UDP, and DDoS-SYN. Further, the SSK-DDoS classification system stores formulated features with their predicted class into the HDFS that will help to retrain the distributed classification approach using a new set of samples. The proposed SSK-DDoS classification system has been validated using the recent CICDDoS2019 dataset. The results show that the proposed SSK-DDoS efficiently classified network flows into seven classes and stored formulated features with the predicted value of each incoming network flow into HDFS.

7.
BMC Bioinformatics ; 22(1): 144, 2021 Mar 22.
Artículo en Inglés | MEDLINE | ID: mdl-33752596

RESUMEN

BACKGROUND: Storage of genomic data is a major cost for the Life Sciences, effectively addressed via specialized data compression methods. For the same reasons of abundance in data production, the use of Big Data technologies is seen as the future for genomic data storage and processing, with MapReduce-Hadoop as leaders. Somewhat surprisingly, none of the specialized FASTA/Q compressors is available within Hadoop. Indeed, their deployment there is not exactly immediate. Such a State of the Art is problematic. RESULTS: We provide major advances in two different directions. Methodologically, we propose two general methods, with the corresponding software, that make very easy to deploy a specialized FASTA/Q compressor within MapReduce-Hadoop for processing files stored on the distributed Hadoop File System, with very little knowledge of Hadoop. Practically, we provide evidence that the deployment of those specialized compressors within Hadoop, not available so far, results in better space savings, and even in better execution times over compressed data, with respect to the use of generic compressors available in Hadoop, in particular for FASTQ files. Finally, we observe that these results hold also for the Apache Spark framework, when used to process FASTA/Q files stored on the Hadoop File System. CONCLUSIONS: Our Methods and the corresponding software substantially contribute to achieve space and time savings for the storage and processing of FASTA/Q files in Hadoop and Spark. Being our approach general, it is very likely that it can be applied also to FASTA/Q compression methods that will appear in the future. AVAILABILITY: The software and the datasets are available at https://github.com/fpalini/fastdoopc.


Asunto(s)
Compresión de Datos , Genómica , Programas Informáticos , Algoritmos , Macrodatos
8.
Sensors (Basel) ; 21(11)2021 May 31.
Artículo en Inglés | MEDLINE | ID: mdl-34072632

RESUMEN

Hadoop MapReduce reactively detects and recovers faults after they occur based on the static heartbeat detection and the re-execution from scratch techniques. However, these techniques lead to excessive response time penalties and inefficient resource consumption during detection and recovery. Existing fault-tolerance solutions intend to mitigate the limitations without considering critical conditions such as fail-slow faults, the impact of faults at various infrastructure levels and the relationship between the detection and recovery stages. This paper analyses the response time under two main conditions: fail-stop and fail-slow, when they manifest with node, service, and the task at runtime. In addition, we focus on the relationship between the time for detecting and recovering faults. The experimental analysis is conducted on a real Hadoop cluster comprising MapReduce, YARN and HDFS frameworks. Our analysis shows that the recovery of a single fault leads to an average of 67.6% response time penalty. Even though the detection and recovery times are well-turned, data locality and resource availability must also be considered to obtain the optimum tolerance time and the lowest penalties.

9.
Sensors (Basel) ; 21(23)2021 Dec 05.
Artículo en Inglés | MEDLINE | ID: mdl-34884135

RESUMEN

Geospatial three-dimensional (3D) raster data have been widely used for simple representations and analysis, such as geological models, spatio-temporal satellite data, hyperspectral images, and climate data. With the increasing requirements of resolution and accuracy, the amount of geospatial 3D raster data has grown exponentially. In recent years, the processing of large raster data using Hadoop has gained popularity. However, data uploaded to Hadoop are randomly distributed onto datanodes without consideration of the spatial characteristics. As a result, the direct processing of geospatial 3D raster data produces a massive network data exchange among the datanodes and degrades the performance of the cluster. To address this problem, we propose an efficient group-based replica placement policy for large-scale geospatial 3D raster data, aiming to optimize the locations of the replicas in the cluster to reduce the network overhead. An overlapped group scheme was designed for three replicas of each file. The data in each group were placed in the same datanode, and different colocation patterns for three replicas were implemented to further reduce the communication between groups. The experimental results show that our approach significantly reduces the network overhead during data acquisition for 3D raster data in the Hadoop cluster, and maintains the Hadoop replica placement requirements.

10.
Sensors (Basel) ; 20(7)2020 Mar 30.
Artículo en Inglés | MEDLINE | ID: mdl-32235657

RESUMEN

Nowadays, the increasing number of patients accompanied with the emergence of new symptoms and diseases makes heath monitoring and assessment a complicated task for medical staff and hospitals. Indeed, the processing of big and heterogeneous data collected by biomedical sensors along with the need of patients' classification and disease diagnosis become major challenges for several health-based sensing applications. Thus, the combination between remote sensing devices and the big data technologies have been proven as an efficient and low cost solution for healthcare applications. In this paper, we propose a robust big data analytics platform for real time patient monitoring and decision making to help both hospital and medical staff. The proposed platform relies on big data technologies and data analysis techniques and consists of four layers: real time patient monitoring, real time decision and data storage, patient classification and disease diagnosis, and data retrieval and visualization. To evaluate the performance of our platform, we implemented our platform based on the Hadoop ecosystem and we applied the proposed algorithms over real health data. The obtained results show the effectiveness of our platform in terms of efficiently performing patient classification and disease diagnosis in healthcare applications.


Asunto(s)
Técnicas Biosensibles , Técnicas y Procedimientos Diagnósticos , Monitoreo Fisiológico , Tecnología de Sensores Remotos , Algoritmos , Macrodatos , Toma de Decisiones , Atención a la Salud , Humanos , Almacenamiento y Recuperación de la Información , Pacientes/clasificación
11.
Biom J ; 62(2): 270-281, 2020 03.
Artículo en Inglés | MEDLINE | ID: mdl-31515855

RESUMEN

The advent of the big data age has changed the landscape for statisticians. Public and private organizations alike these days are interested in capturing and analyzing complex customer data in order to improve their service and drive efficiency gains. However, the large volume of data involved often means that standard statistical methods fail and new ways of thinking are needed. Although great gains can be obtained through the use of more advanced computing environments or through developing sophisticated new statistical algorithms that handle data in a more efficient way, there are also many simpler things that can be done to handle large data sets in an efficient and intuitive manner. These include the use of distributed analysis methodologies, clever subsampling, data coarsening, and clever data reductions that exploit concepts such as sufficiency. These kinds of strategies represent exciting opportunities for statisticians to remain front and center in the data science world.


Asunto(s)
Biometría/métodos , Algoritmos , Programas Informáticos , Factores de Tiempo
12.
BMC Bioinformatics ; 20(1): 76, 2019 Feb 14.
Artículo en Inglés | MEDLINE | ID: mdl-30764760

RESUMEN

BACKGROUND: The advance of next generation sequencing enables higher throughput with lower price, and as the basic of high-throughput sequencing data analysis, variant calling is widely used in disease research, clinical treatment and medicine research. However, current mainstream variant caller tools have a serious problem of computation bottlenecks, resulting in some long tail tasks when performing on large datasets. This prevents high scalability on clusters of multi-node and multi-core, and leads to long runtime and inefficient usage of computing resources. Thus, a high scalable tool which could run in distributed environment will be highly useful to accelerate variant calling on large scale genome data. RESULTS: In this paper, we present ADS-HCSpark, a scalable tool for variant calling based on Apache Spark framework. ADS-HCSpark accelerates the process of variant calling by implementing the parallelization of mainstream GATK HaplotypeCaller algorithm on multi-core and multi-node. Aiming at solving the problem of computation skew in HaplotypeCaller, a parallel strategy of adaptive data segmentation is proposed and a variant calling algorithm based on adaptive data segmentation is implemented, which achieves good scalability on both single-node and multi-node. For the requirement that adjacent data blocks should have overlapped boundaries, Hadoop-BAM library is customized to implement partitioning BAM file into overlapped blocks, further improving the accuracy of variant calling. CONCLUSIONS: ADS-HCSpark is a scalable tool to achieve variant calling based on Apache Spark framework, implementing the parallelization of GATK HaplotypeCaller algorithm. ADS-HCSpark is evaluated on our cluster and in the case of best performance that could be achieved in this experimental platform, ADS-HCSpark is 74% faster than GATK3.8 HaplotypeCaller on single-node experiments, 57% faster than GATK4.0 HaplotypeCallerSpark and 27% faster than SparkGA on multi-node experiments, with better scalability and the accuracy of over 99%. The source code of ADS-HCSpark is publicly available at https://github.com/SCUT-CCNL/ADS-HCSpark.git .


Asunto(s)
Algoritmos , Variación Genética , Haplotipos/genética , Programas Informáticos , Bases de Datos Genéticas , Genoma , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Análisis de Secuencia de ADN/métodos , Factores de Tiempo
13.
BMC Genomics ; 20(Suppl 11): 948, 2019 Dec 20.
Artículo en Inglés | MEDLINE | ID: mdl-31856721

RESUMEN

BACKGROUND: Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads. METHODS: In this paper, we present a new hybrid error correction tool, called ParLECH (Parallel Long-read Error Correction using Hybrid methodology). The error correction algorithm of ParLECH is distributed in nature and efficiently utilizes the k-mer coverage information of high throughput Illumina short-read sequences to rectify the PacBio long-read sequences.ParLECH first constructs a de Bruijn graph from the short reads, and then replaces the indel error regions of the long reads with their corresponding widest path (or maximum min-coverage path) in the short read-based de Bruijn graph. ParLECH then utilizes the k-mer coverage information of the short reads to divide each long read into a sequence of low and high coverage regions, followed by a majority voting to rectify each substituted error base. RESULTS: ParLECH outperforms latest state-of-the-art hybrid error correction methods on real PacBio datasets. Our experimental evaluation results demonstrate that ParLECH can correct large-scale real-world datasets in an accurate and scalable manner. ParLECH can correct the indel errors of human genome PacBio long reads (312 GB) with Illumina short reads (452 GB) in less than 29 h using 128 compute nodes. ParLECH can align more than 92% bases of an E. coli PacBio dataset with the reference genome, proving its accuracy. CONCLUSION: ParLECH can scale to over terabytes of sequencing data using hundreds of computing nodes. The proposed hybrid error correction methodology is novel and rectifies both indel and substitution errors present in the original long reads or newly introduced by the short reads.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Bases de Datos Genéticas , Genoma/genética , Mutación , Programas Informáticos
14.
Sensors (Basel) ; 19(12)2019 Jun 17.
Artículo en Inglés | MEDLINE | ID: mdl-31212959

RESUMEN

Underwater sensor networks have wide application prospects, but the large-scale sensing node deployment is severely hindered by problems like energy constraints, long delays, local disconnections, and heavy energy consumption. These problems can be solved effectively by optimizing sensing node deployment with a genetic algorithm. However, the genetic algorithm (GA) needs many iterations in solving the best location of underwater sensor deployment, which results in long running time delays and limited practical application when dealing with large-scale data. The classical parallel framework Hadoop can improve the GA running efficiency to some extent while the state-of-the-art parallel framework Spark can release much more parallel potential of GA by realizing parallel crossover, mutation, and other operations on each computing node. Giving full allowance for the working environment of the underwater sensor network and the characteristics of sensors, this paper proposes a Spark-based parallel GA to calculate the extremum of the Shubert multi-peak function, through which the optimal deployment of the underwater sensor network can be obtained. Experimental results show that while faced with a large-scale underwater sensor network, compared with single node and Hadoop framework, the Spark-based implementation not only significantly reduces the running time but also effectively avoids the problem of premature convergence because of its powerful randomness.

15.
Sensors (Basel) ; 19(1)2019 Jan 02.
Artículo en Inglés | MEDLINE | ID: mdl-30609759

RESUMEN

The large amount of programmable logic controller (PLC) sensing data has rapidly increased in the manufacturing environment. Therefore, a large data store is necessary for Big Data platforms. In this paper, we propose a Hadoop ecosystem for the support of many features in the manufacturing industry. In this ecosystem, Apache Hadoop and HBase are used as Big Data storage and handle large scale data. In addition, Apache Kafka is used as a data streaming pipeline which contains many configurations and properties that are used to make a better-designed environment and a reliable system, such as Kafka offset and partition, which is used for program scaling purposes. Moreover, Apache Spark closely works with Kafka consumers to create a real-time processing and analysis of the data. Meanwhile, data security is applied in the data transmission phase between the Kafka producers and consumers. Public-key cryptography is performed as a security method which contains public and private keys. Additionally, the public-key is located in the Kafka producer, and the private-key is stored in the Kafka consumer. The integration of these above technologies will enhance the performance and accuracy of data storing, processing, and securing in the manufacturing environment.

16.
J Med Syst ; 43(8): 239, 2019 Jun 19.
Artículo en Inglés | MEDLINE | ID: mdl-31218510

RESUMEN

An exponential rise has been observed in the data volume over the time when considering a real time environment. A phenomenal feature termed as 'Predictability' helps in predicting and portraying related data to the user according to their needs. Moreover, classification of Big Data is usually a tedious and lengthy task. The technique of MapReduce Framework performs the data processing that being paralleled by data distribution in small chunks through the clusters. This Map Reduce technique is being proposed which is employed to process heterogeneous data items. Few issues that are being targeted in the existing paper include associating climatologically and meteorological information with large variety of farming decisions. Using the well-known MapReduce framework the above issues and challenges can be resolved. The existing paper proposes empirical techniques of climate classification and prediction by adopting Co-EANFS (Co-Effective and Adaptive Neuro-Fuzzy System) approach for data handling. Furthermore, the paper examines association rule mining too, which is being implemented for examining the best crop production by relying upon the soil and weather condition. Lastly, a technique is proposed for managing various levels such as preprocessing, clustering, classification and prediction. First, the weather dataset is being collected which undergoes processing; thereafter the proposed model is implemented which results in formation of cluster data sets linked to each season. For evaluating the performance, accuracy predictions generated by Co-EANFS is used which being formulated with varying no: of inputs and variables. The proposed framework acquires least execution time.


Asunto(s)
Macrodatos , Manejo de Datos/métodos , Programas Informáticos , Algoritmos , Lógica Difusa , Humanos , Factores de Tiempo , Tiempo (Meteorología)
17.
Curr Genomics ; 19(7): 603-614, 2018 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-30386172

RESUMEN

System biology problems such as whole-genome network construction from large-scale gene expression data are sophisticated and time-consuming. Therefore, using sequential algorithms are not feasible to obtain a solution in an acceptable amount of time. Today, by using massively parallel computing, it is possible to infer large-scale gene regulatory networks. Recently, establishing gene regulatory networks from large-scale datasets have drawn the noticeable attention of researchers in the field of parallel computing and system biology. In this paper, we attempt to provide a more detailed overview of the recent parallel algorithms for constructing gene regulatory networks. Firstly, fundamentals of gene regulatory networks inference and large-scale datasets challenges are given. Secondly, a detailed description of the four parallel frameworks and libraries including CUDA, OpenMP, MPI, and Hadoop is discussed. Thirdly, parallel algorithms are reviewed. Finally, some conclusions and guidelines for parallel reverse engineering are described.

18.
Sensors (Basel) ; 18(4)2018 Apr 12.
Artículo en Inglés | MEDLINE | ID: mdl-29649172

RESUMEN

A smart city implies a consistent use of technology for the benefit of the community. As the city develops over time, components and subsystems such as smart grids, smart water management, smart traffic and transportation systems, smart waste management systems, smart security systems, or e-governance are added. These components ingest and generate a multitude of structured, semi-structured or unstructured data that may be processed using a variety of algorithms in batches, micro batches or in real-time. The ICT architecture must be able to handle the increased storage and processing needs. When vertical scaling is no longer a viable solution, Hadoop can offer efficient linear horizontal scaling, solving storage, processing, and data analyses problems in many ways. This enables architects and developers to choose a stack according to their needs and skill-levels. In this paper, we propose a Hadoop-based architectural stack that can provide the ICT backbone for efficiently managing a smart city. On the one hand, Hadoop, together with Spark and the plethora of NoSQL databases and accompanying Apache projects, is a mature ecosystem. This is one of the reasons why it is an attractive option for a Smart City architecture. On the other hand, it is also very dynamic; things can change very quickly, and many new frameworks, products and options continue to emerge as others decline. To construct an optimized, modern architecture, we discuss and compare various products and engines based on a process that takes into consideration how the products perform and scale, as well as the reusability of the code, innovations, features, and support and interest in online communities.

19.
Sensors (Basel) ; 18(5)2018 May 03.
Artículo en Inglés | MEDLINE | ID: mdl-29751580

RESUMEN

Analyzing huge amounts of data becomes essential in the era of Big Data, where databases are populated with hundreds of Gigabytes that must be processed to extract knowledge. Hence, classical algorithms must be adapted towards distributed computing methodologies that leverage the underlying computational power of these platforms. Here, a parallel, scalable, and optimized design for self-organized maps (SOM) is proposed in order to analyze massive data gathered by the spectrophotometric sensor of the European Space Agency (ESA) Gaia spacecraft, although it could be extrapolated to other domains. The performance comparison between the sequential implementation and the distributed ones based on Apache Hadoop and Apache Spark is an important part of the work, as well as the detailed analysis of the proposed optimizations. Finally, a domain-specific visualization tool to explore astronomical SOMs is presented.

20.
J Med Syst ; 42(4): 59, 2018 Feb 19.
Artículo en Inglés | MEDLINE | ID: mdl-29460090

RESUMEN

The huge increases in medical devices and clinical applications which generate enormous data have raised a big issue in managing, processing, and mining this massive amount of data. Indeed, traditional data warehousing frameworks can not be effective when managing the volume, variety, and velocity of current medical applications. As a result, several data warehouses face many issues over medical data and many challenges need to be addressed. New solutions have emerged and Hadoop is one of the best examples, it can be used to process these streams of medical data. However, without an efficient system design and architecture, these performances will not be significant and valuable for medical managers. In this paper, we provide a short review of the literature about research issues of traditional data warehouses and we present some important Hadoop-based data warehouses. In addition, a Hadoop-based architecture and a conceptual data model for designing medical Big Data warehouse are given. In our case study, we provide implementation detail of big data warehouse based on the proposed architecture and data model in the Apache Hadoop platform to ensure an optimal allocation of health resources.


Asunto(s)
Técnicas de Apoyo para la Decisión , Recursos en Salud/organización & administración , Almacenamiento y Recuperación de la Información/métodos , Argelia , Recursos en Salud/provisión & distribución , Humanos , Modelos Teóricos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA