Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
1.
Molecules ; 24(6)2019 Mar 19.
Artigo em Inglês | MEDLINE | ID: mdl-30893816

RESUMO

Drug-drug interaction (DDI) is becoming a serious issue in clinical pharmacy as the use of multiple medications is more common. The PubMed database is one of the biggest literature resources for DDI studies. It contains over 150,000 journal articles related to DDI and is still expanding at a rapid pace. The extraction of DDI-related information, including compounds and proteins from PubMed, is an essential step for DDI research. In this paper, we introduce a tool, CuDDI (compute unified device architecture-based DDI searching), for identification of DDI-related terms (including compounds and proteins) from PubMed. There are three modules in this application, including the automatic retrieval of substances from PubMed, the identification of DDI-related terms, and the display of relationship of DDI-related terms. For DDI term identification, a speedup of 30⁻105 times was observed for the compute unified device architecture (CUDA)-based version compared with the implementation with a CPU-based Python version. CuDDI can be used to discover DDI-related terms and relationships of these terms, which has the potential to help clinicians and pharmacists better understand the mechanism of DDIs. CuDDI is available at: https://github.com/chengusf/CuDDI.


Assuntos
Interações Medicamentosas , PubMed , Algoritmos , Mineração de Dados , Humanos , Publicações , Software
2.
IEEE Trans Knowl Data Eng ; 26(10): 2410-2424, 2014 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-25264418

RESUMO

This paper focuses on an important query in scientific simulation data analysis: the Spatial Distance Histogram (SDH). The computation time of an SDH query using brute force method is quadratic. Often, such queries are executed continuously over certain time periods, increasing the computation time. We propose highly efficient approximate algorithm to compute SDH over consecutive time periods with provable error bounds. The key idea of our algorithm is to derive statistical distribution of distances from the spatial and temporal characteristics of particles. Upon organizing the data into a Quad-tree based structure, the spatiotemporal characteristics of particles in each node of the tree are acquired to determine the particles' spatial distribution as well as their temporal locality in consecutive time periods. We report our efforts in implementing and optimizing the above algorithm in Graphics Processing Units (GPUs) as means to further improve the efficiency. The accuracy and efficiency of the proposed algorithm is backed by mathematical analysis and results of extensive experiments using data generated from real simulation studies.

3.
Knowl Inf Syst ; 37(1): 219-244, 2013 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-24729652

RESUMO

Data uncertainty is inherent in many real-world applications such as sensor monitoring systems, location-based services, and medical diagnostic systems. Moreover, many real-world applications are now capable of producing continuous, unbounded data streams. During the recent years, new methods have been developed to find frequent patterns in uncertain databases; nevertheless, very limited work has been done in discovering frequent patterns in uncertain data streams. The current solutions for frequent pattern mining in uncertain streams take a FP-tree-based approach; however, recent studies have shown that FP-tree-based algorithms do not perform well in the presence of data uncertainty. In this paper, we propose two hyper-structure-based false-positive-oriented algorithms to efficiently mine frequent itemsets from streams of uncertain data. The first algorithm, UHS-Stream, is designed to find all frequent itemsets up to the current moment. The second algorithm, TFUHS-Stream, is designed to find frequent itemsets in an uncertain data stream in a time-fading manner. Experimental results show that the proposed hyper-structure-based algorithms outperform the existing tree-based algorithms in terms of accuracy, runtime, and memory usage.

4.
IEEE Trans Knowl Data Eng ; 25(9): 1982-1996, 2012 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-24693210

RESUMO

Particle simulation has become an important research tool in many scientific and engineering fields. Data generated by such simulations impose great challenges to database storage and query processing. One of the queries against particle simulation data, the spatial distance histogram (SDH) query, is the building block of many high-level analytics, and requires quadratic time to compute using a straightforward algorithm. Previous work has developed efficient algorithms that compute exact SDHs. While beating the naive solution, such algorithms are still not practical in processing SDH queries against large-scale simulation data. In this paper, we take a different path to tackle this problem by focusing on approximate algorithms with provable error bounds. We first present a solution derived from the aforementioned exact SDH algorithm, and this solution has running time that is unrelated to the system size N. We also develop a mathematical model to analyze the mechanism that leads to errors in the basic approximate algorithm. Our model provides insights on how the algorithm can be improved to achieve higher accuracy and efficiency. Such insights give rise to a new approximate algorithm with improved time/accuracy tradeoff. Experimental results confirm our analysis.

5.
VLDB J ; 20(4): 471-494, 2011 Aug 01.
Artigo em Inglês | MEDLINE | ID: mdl-21804753

RESUMO

Many scientific and engineering fields produce large volume of spatiotemporal data. The storage, retrieval, and analysis of such data impose great challenges to database systems design. Analysis of scientific spatiotemporal data often involves computing functions of all point-to-point interactions. One such analytics, the Spatial Distance Histogram (SDH), is of vital importance to scientific discovery. Recently, algorithms for efficient SDH processing in large-scale scientific databases have been proposed. These algorithms adopt a recursive tree-traversing strategy to process point-to-point distances in the visited tree nodes in batches, thus require less time when compared to the brute-force approach where all pairwise distances have to be computed. Despite the promising experimental results, the complexity of such algorithms has not been thoroughly studied. In this paper, we present an analysis of such algorithms based on a geometric modeling approach. The main technique is to transform the analysis of point counts into a problem of quantifying the area of regions where pairwise distances can be processed in batches by the algorithm. From the analysis, we conclude that the number of pairwise distances that are left to be processed decreases exponentially with more levels of the tree visited. This leads to the proof of a time complexity lower than the quadratic time needed for a brute-force algorithm and builds the foundation for a constant-time approximate algorithm. Our model is also general in that it works for a wide range of point spatial distributions, histogram types, and space-partitioning options in building the tree.

6.
Data Min Knowl Discov ; 34(4): 980-1021, 2020 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38390222

RESUMO

In recent years, the popularity of graph databases has grown rapidly. This paper focuses on single-graph as an effective model to represent information and its related graph mining techniques. In frequent pattern mining in a single-graph setting, there are two main problems: support measure and search scheme. In this paper, we propose a novel framework for designing support measures that brings together existing minimum-image-based and overlap-graph-based support measures. Our framework is built on the concept of occurrence/instance hypergraphs. Based on such, we are able to design a series of new support measures: minimum instance (MI) measure, and minimum vertex cover (MVC) measure, that combine the advantages of existing measures. More importantly, we show that the existing minimum-image-based support measure is an upper bound of the MI measure, which is also linear-time computable and results in counts that are close to number of instances of a pattern. We show that not only most major existing support measures and new measures proposed in this paper can be mapped into the new framework, but also they occupy different locations of the frequency spectrum. By taking advantage of the new framework, we discover that MVC can be approximated to a constant factor (in terms of number of pattern nodes) in polynomial time. In contrast to common belief, we demonstrate that the state-of-the-art overlap-graph-based maximum independent set (MIS) measure also has constant approximation algorithms. We further show that using standard linear programming and semidefinite programming techniques, polynomial-time relaxations for both MVC and MIS measures can be developed and their counts stand between MVC and MIS. In addition, we point out that MVC, MIS, and their relaxations are bounded within constant factor. In summary, all major support measures are unified in the new hypergraph-based framework which helps reveal their bounding relations and hardness properties.

7.
Proceedings VLDB Endowment ; 14(4): 708-720, 2020 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-38260211

RESUMO

Relational join processing is one of the core functionalities in database management systems. It has been demonstrated that GPUs as a general-purpose parallel computing platform is very promising in processing relational joins. However, join algorithms often need to handle very large input data, which is an issue that was not sufficiently addressed in existing work. Besides, as more and more desktop and workstation platforms support multi-GPU environment, the combined computing capability of multiple GPUs can easily achieve that of a computing cluster. It is worth exploring how join processing would benefit from the adaptation of multiple GPUs. We identify the low rate and complex patterns of data transfer among the CPU and GPUs as the main challenges in designing efficient algorithms for large table joins. To overcome such challenges, we propose three distinctive designs of multi-GPU join algorithms, namely, the nested loop, global sort-merge and hybrid joins for large table joins with different join conditions. Extensive experiments running on multiple databases and two different hardware configurations demonstrate high scalability of our algorithms over data size and significant performance boost brought by the use of multiple GPUs. Furthermore, our algorithms achieve much better performance as compared to existing join algorithms, with a speedup up to 25X and 2.8X over best known code developed for multi-core CPUs and GPUs respectively.

8.
PLoS One ; 14(4): e0214720, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30990851

RESUMO

The unrivaled computing capabilities of modern GPUs meet the demand of processing massive amounts of data seen in many application domains. While traditional HPC systems support applications as standalone entities that occupy entire GPUs, there are GPU-based DBMSs where multiple tasks are meant to be run at the same time in the same device. To that end, system-level resource management mechanisms are needed to fully unleash the computing power of GPUs in large data processing, and there were some researches focusing on it. In our previous work, we explored the single compute-bound kernel modeling on GPUs under NVidia's CUDA framework and provided an in-depth anatomy of the NVidia's concurrent kernel execution mechanism (CUDA stream). This paper focuses on resource allocation of multiple GPU applications towards optimization of system throughput in the context of systems. Comparing to earlier studies of enabling concurrent tasks support on GPU such as MultiQx-GPU, we use a different approach that is to control the launching parameters of multiple GPU kernels as provided by compile-time performance modeling as a kernel-level optimization and also a more general pre-processing model with batch-level control to enhance performance. Specifically, we construct a variation of multi-dimensional knapsack model to maximize concurrency in a multi-kernel environment. We present an in-depth analysis of our model and develop an algorithm based on dynamic programming technique to solve the model. We prove the algorithm can find optimal solutions (in terms of thread concurrency) to the problem and bears pseudopolynomial complexity on both time and space. Such results are verified by extensive experiments running on our microbenchmark that consists of real-world GPU queries. Furthermore, solutions identified by our method also significantly reduce the total running time of the workload, as compared to sequential and MultiQx-GPU executions.


Assuntos
Bases de Dados Factuais , Gráficos por Computador , Software
9.
Proc ACM SIGMOD Int Conf Manag Data ; 2017: 391-402, 2017 May.
Artigo em Inglês | MEDLINE | ID: mdl-38425568

RESUMO

In recent years, the popularity of graph databases has grown rapidly. This paper focuses on single-graph as an effective model to represent information and its related graph mining techniques. In frequent pattern mining in a single-graph setting, there are two main problems: support measure and search scheme. In this paper, we propose a novel framework for constructing support measures that brings together existing minimum-image-based and overlap-graph-based support measures. Our framework is built on the concept of occurrence / instance hypergraphs. Based on that, we present two new support measures: minimum instance (MI) measure and minimum vertex cover (MVC) measure, that combine the advantages of existing measures. In particular, we show that the existing minimum-image-based support measure is an upper bound of the MI measure, which is also linear-time computable and results in counts that are close to number of instances of a pattern. Although the MVC measure is NP-hard, it can be approximated to a constant factor in polynomial time. We also provide polynomial-time relaxations for both measures and bounding theorems for all presented support measures in the hypergraph setting. We further show that the hypergraph-based framework can unify all support measures studied in this paper. This framework is also flexible in that more variants of support measures can be defined and profiled in it.

10.
Artigo em Inglês | MEDLINE | ID: mdl-38298773

RESUMO

Processing relational joins on modern GPUs has attracted much attention in the past few years. With the rapid development on the hardware and software environment in the GPU world, the existing GPU join algorithms designed for earlier architecture cannot make the most out of latest GPU products. In this paper, we report new design and implementation of join algorithms with high performance under today's GPGPU environment. This is a key component of our scientific database engine named G-SDMS. In particular, we overhaul the popular radix hash join and redesign sort-merge join algorithms on GPUs by applying a series of novel techniques to utilize the hardware capacity of latest Nvidia GPU architecture and new features of the CUDA programming framework. Our algorithms take advantage of revised hardware arrangement, larger register file and shared memory, native atomic operation, dynamic parallelism, and CUDA Streams. Experiments show that our new hash join algorithm is 2.0 to 14.6 times as efficient as existing GPU implementation, while the new sort-merge join achieves a speedup of 4.0X to 4.9X. Compared to the best CPU sort-merge join and hash join known to date, our optimized code achieves up to 10.5X and 5.5X speedup. Moreover, we extend our design to scenarios where large data tables cannot fit in the GPU memory.

11.
PLoS One ; 12(4): e0173548, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28422961

RESUMO

Identifying drug-drug interaction (DDI) is an important topic for the development of safe pharmaceutical drugs and for the optimization of multidrug regimens for complex diseases such as cancer and HIV. There have been about 150,000 publications on DDIs in PubMed, which is a great resource for DDI studies. In this paper, we introduced an automatic computational method for the systematic analysis of the mechanism of DDIs using MeSH (Medical Subject Headings) terms from PubMed literature. MeSH term is a controlled vocabulary thesaurus developed by the National Library of Medicine for indexing and annotating articles. Our method can effectively identify DDI-relevant MeSH terms such as drugs, proteins and phenomena with high accuracy. The connections among these MeSH terms were investigated by using co-occurrence heatmaps and social network analysis. Our approach can be used to visualize relationships of DDI terms, which has the potential to help users better understand DDIs. As the volume of PubMed records increases, our method for automatic analysis of DDIs from the PubMed database will become more accurate.


Assuntos
Interações Medicamentosas , Medical Subject Headings , Medicamentos sob Prescrição/farmacologia , PubMed/estatística & dados numéricos , Algoritmos , Sistema Enzimático do Citocromo P-450/metabolismo , Humanos , Curva ROC
12.
Proc Int Conf Web Inf Syst Eng ; 9419: 458-472, 2015 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-26997936

RESUMO

Privacy and usage restriction issues are important when valuable data are exchanged or acquired by different organizations. Standard access control mechanisms either restrict or completely grant access to valuable data. On the other hand, data obfuscation limits the overall usability and may result in loss of total value. There are no standard policy enforcement mechanisms for data acquired through mutual and copyright agreements. In practice, many different types of policies can be enforced in protecting data privacy. Hence there is the need for an unified framework that encapsulates multiple suites of policies to protect the data. We present our vision of an architecture named security automata model (SAM) to enforce privacy-preserving policies and usage restrictions. SAM analyzes the input queries and their outputs to enforce various policies, liberating data owners from the burden of monitoring data access. SAM allows administrators to specify various policies and enforces them to monitor queries and control the data access. Our goal is to address the problems of data usage control and protection through privacy policies that can be defined, enforced, and integrated with the existing access control mechanisms using SAM. In this paper, we lay out the theoretical foundation of SAM, which is based on an automata named Mandatory Result Automata. We also discuss the major challenges of implementing SAM in a real-world database environment as well as ideas to meet such challenges.

13.
Sci Rep ; 5: 17357, 2015 Nov 27.
Artigo em Inglês | MEDLINE | ID: mdl-26612138

RESUMO

Drug-drug interaction (DDI) is becoming a serious clinical safety issue as the use of multiple medications becomes more common. Searching the MEDLINE database for journal articles related to DDI produces over 330,000 results. It is impossible to read and summarize these references manually. As the volume of biomedical reference in the MEDLINE database continues to expand at a rapid pace, automatic identification of DDIs from literature is becoming increasingly important. In this article, we present a random-sampling-based statistical algorithm to identify possible DDIs and the underlying mechanism from the substances field of MEDLINE records. The substances terms are essentially carriers of compound (including protein) information in a MEDLINE record. Four case studies on warfarin, ibuprofen, furosemide and sertraline implied that our method was able to rank possible DDIs with high accuracy (90.0% for warfarin, 83.3% for ibuprofen, 70.0% for furosemide and 100% for sertraline in the top 10% of a list of compounds ranked by p-value). A social network analysis of substance terms was also performed to construct networks between proteins and drug pairs to elucidate how the two drugs could interact.


Assuntos
Algoritmos , Furosemida/uso terapêutico , Ibuprofeno/uso terapêutico , Sertralina/uso terapêutico , Varfarina/uso terapêutico , Contraindicações , Interações Medicamentosas , Enzimas/metabolismo , Humanos , Hidrólise , MEDLINE/estatística & dados numéricos
14.
J Big Data ; 2(1): 9, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26069879

RESUMO

Molecular Simulation (MS) is a powerful tool for studying physical/chemical features of large systems and has seen applications in many scientific and engineering domains. During the simulation process, the experiments generate a very large number of atoms and intend to observe their spatial and temporal relationships for scientific analysis. The sheer data volumes and their intensive interactions impose significant challenges for data accessing, managing, and analysis. To date, existing MS software systems fall short on storage and handling of MS data, mainly because of the missing of a platform to support applications that involve intensive data access and analytical process. In this paper, we present the database-centric molecular simulation (DCMS) system our team developed in the past few years. The main idea behind DCMS is to store MS data in a relational database management system (DBMS) to take advantage of the declarative query interface (i.e., SQL), data access methods, query processing, and optimization mechanisms of modern DBMSs. A unique challenge is to handle the analytical queries that are often compute-intensive. For that, we developed novel indexing and query processing strategies (including algorithms running on modern co-processors) as integrated components of the DBMS. As a result, researchers can upload and analyze their data using efficient functions implemented inside the DBMS. Index structures are generated to store analysis results that may be interesting to other users, so that the results are readily available without duplicating the analysis. We have developed a prototype of DCMS based on the PostgreSQL system and experiments using real MS data and workload show that DCMS significantly outperforms existing MS software systems. We also used it as a platform to test other data management issues such as security and compression.

15.
Proc IEEE Int Conf Big Data ; 2014: 301-310, 2014 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26566545

RESUMO

Push-based database management system (DBMS) is a new type of data processing software that streams large volume of data to concurrent query operators. The high data rate of such systems requires large computing power provided by the query engine. In our previous work, we built a push-based DBMS named G-SDMS to harness the unrivaled computational capabilities of modern GPUs. A major design goal of G-SDMS is to support concurrent processing of heterogenous query processing operations and enable resource allocation among such operations. Understanding the performance of operations as a result of resource consumption is thus a premise in the design of G-SDMS. With NVIDIA's CUDA framework as the system implementation platform, we present our recent work on performance modeling of CUDA kernels running concurrently under a runtime mechanism named CUDA stream. Specifically, we explore the connection between performance and resource occupancy of compute-bound kernels and develop a model that can predict the performance of such kernels. Furthermore, we provide an in-depth anatomy of the CUDA stream mechanism and summarize the main kernel scheduling disciplines in it. Our models and derived scheduling disciplines are verified by extensive experiments using synthetic and real-world CUDA kernels.

16.
Artigo em Inglês | MEDLINE | ID: mdl-26973983

RESUMO

In this paper, we present a compression framework, for molecular dynamics (MD) simulation data, which yields significant performance by combining the strength of principal component analysis (PCA) and discrete cosine transform (DCT). Though it is a lossy compression technique, the effect on analytics performed on decompressed data is very minimal. Compression ratio up to 13 is achieved with acceptable errors in results of analytical functions.

17.
ACM Trans Knowl Discov Data ; 7(4): 17, 2013 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-24729776

RESUMO

Along with increasing popularity of social websites, online users rely more on the trustworthiness information to make decisions, extract and filter information, and tag and build connections with other users. However, such social network data often suffer from severe data sparsity and are not able to provide users with enough information. Therefore, trust prediction has emerged as an important topic in social network research. Traditional approaches are primarily based on exploring trust graph topology itself. However, research in sociology and our life experience suggest that people who are in the same social circle often exhibit similar behaviors and tastes. To take advantage of the ancillary information for trust prediction, the challenge then becomes what to transfer and how to transfer. In this article, we address this problem by aggregating heterogeneous social networks and propose a novel joint social networks mining (JSNM) method. Our new joint learning model explores the user-group-level similarity between correlated graphs and simultaneously learns the individual graph structure; therefore, the shared structures and patterns from multiple social networks can be utilized to enhance the prediction tasks. As a result, we not only improve the trust prediction in the target graph but also facilitate other information retrieval tasks in the auxiliary graphs. To optimize the proposed objective function, we use the alternative technique to break down the objective function into several manageable subproblems. We further introduce the auxiliary function to solve the optimization problems with rigorously proved convergence. The extensive experiments have been conducted on both synthetic and real- world data. All empirical results demonstrate the effectiveness of our method.

18.
ACM BCB ; 2012: 527-529, 2012 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26566544

RESUMO

Analysis of large particle or molecular simulation data is integral part of the basic-science research community. It often involves computing functions such as point-to-point interactions of particles. Spatial distance histogram (SDH) is one such vital computation in scientific discovery. SDH is frequently used to compute Radial Distribution Function (RDF), and it takes quadratic time to compute using naive approach. Naive SDH computation is even more expensive as it is computed continuously over certain period of time to analyze simulation systems. Tree-based SDH computation is a popular approach. In this paper we look at different tree-based SDH computation techniques and briefly discuss about their performance. We present different strategies to improve the performance of these techniques. Specifically, we study the density map (DM) based SDH computation techniques. A DM is essentially a grid dividing simulated space into cells (3D cubes) of equal size (volume), which can be easily implemented by augmenting a Quad-tree (or Oct-tree) index. DMs are used in various configurations to compute SDH continuously over snapshots of the simulation system. The performance improvements using some of these configurations is presented in this paper. We also present the effect of utilizing computation power of Graphics Processing Units (GPUs) in computing SDH.

19.
Artigo em Inglês | MEDLINE | ID: mdl-24378961

RESUMO

Large data generated by scientific applications imposes challenges in storage and efficient query processing. Many queries against scientific data are analytical in nature and require super-linear computation time using straightforward methods. Spatial distance histogram (SDH) is one of the basic queries to analyze the molecular simulation (MS) data, and it takes quadratic time to compute using brute-force approach. Often, an SDH query is executed continuously to analyze the simulation system over a period of time. This adds to the total time required to compute SDH. In this paper, we propose an approximate algorithm to compute SDH efficiently over consecutive time periods. In our approach, data is organized into a Quad-tree based data structure. The spatial locality of the particles (at given time) in each node of the tree is acquired to determine the particle distribution. Similarly, the temporal locality of particles (between consecutive time periods) in each node is also acquired. The spatial distribution and temporal locality are utilized to compute the approximate SDH at every time instant. The performance is boosted by storing and updating the spatial distribution information over time. The efficiency and accuracy of the proposed algorithm is supported by mathematical analysis and results of extensive experiments using biological data generated from real MS studies.

20.
Evol Comput ; 16(3): 355-84, 2008.
Artigo em Inglês | MEDLINE | ID: mdl-18811246

RESUMO

We present a new non-dominated sorting algorithm to generate the non-dominated fronts in multi-objective optimization with evolutionary algorithms, particularly the NSGA-II. The non-dominated sorting algorithm used by NSGA-II has a time complexity of O(MN(2)) in generating non-dominated fronts in one generation (iteration) for a population size N and M objective functions. Since generating non-dominated fronts takes the majority of total computational time (excluding the cost of fitness evaluations) of NSGA-II, making this algorithm faster will significantly improve the overall efficiency of NSGA-II and other genetic algorithms using non-dominated sorting. The new non-dominated sorting algorithm proposed in this study reduces the number of redundant comparisons existing in the algorithm of NSGA-II by recording the dominance information among solutions from their first comparisons. By utilizing a new data structure called the dominance tree and the divide-and-conquer mechanism, the new algorithm is faster than NSGA-II for different numbers of objective functions. Although the number of solution comparisons by the proposed algorithm is close to that of NSGA-II when the number of objectives becomes large, the total computational time shows that the proposed algorithm still has better efficiency because of the adoption of the dominance tree structure and the divide-and-conquer mechanism.


Assuntos
Algoritmos , Modelos Teóricos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA