Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 27
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
ICS ; 2023: 155-166, 2023 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-37584044

RESUMEN

Advancements in Next-Generation Sequencing (NGS) have significantly reduced the cost of generating DNA sequence data and increased the speed of data production. However, such high-throughput data production has increased the need for efficient data analysis programs. One of the most computationally demanding steps in analyzing sequencing data is mapping short reads produced by NGS to a reference DNA sequence, such as a human genome. The mapping program BWA-MEM and its newer version BWA-MEM2, optimized for CPUs, are some of the most popular choices for this task. In this study, we discuss the implementation of BWA-MEM on GPUs. This is a challenging task because many algorithms and data structures in BWA-MEM do not execute efficiently on the GPU architecture. This paper identifies major challenges in developing efficient GPU code on all major stages of the BWA-MEM program, including seeding, seed chaining, Smith-Waterman alignment, memory management, and I/O handling. We conduct comparison experiments against BWA-MEM and BWA-MEM2 running on a 64-thread CPU. The results show that our implementation achieved up to 3.2x speedup over BWA-MEM2 and up to 5.8x over BWA-MEM when using an NVIDIA A40. Using an NVIDIA A6000 and an NVIDIA A100, we achieved a wall-time speedup of up to 3.4x/3.8x over BWA-MEM2 and up to 6.1x/6.8x over BWA-MEM, respectively. In stage-wise comparison, the A40/A6000/A100 GPUs respectively achieved up to 3.7/3.8/4x, 2/2.3/2.5x, and 3.1/5/7.9x speedup on the three major stages of BWA-MEM: seeding and seed chaining, Smith-Waterman, and making SAM output. To the best of our knowledge, this is the first study that attempts to implement the entire BWA-MEM program on GPUs.

2.
ICS ; 20222022 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-35943281

RESUMEN

Due to the high level of parallelism, there are unique challenges in developing system software on massively parallel hardware such as GPUs. One such challenge is designing a dynamic memory allocator whose task is to allocate memory chunks to requesting threads at runtime. State-of-the-art GPU memory allocators maintain a global data structure holding metadata to facilitate allocation/deallocation. However, the centralized data structure can easily become a bottleneck in a massively parallel system. In this paper, we present a novel approach for designing dynamic memory allocation without a centralized data structure. The core idea is to let threads follow a random search procedure to locate free pages. Then we further extend to more advanced designs and algorithms that can achieve an order of magnitude improvement over the basic idea. We present mathematical proofs to demonstrate that (1) the basic random search design achieves asymptotically lower latency than the traditional queue-based design and (2) the advanced designs achieve significant improvement over the basic idea. Extensive experiments show consistency to our mathematical models and demonstrate that our solutions can achieve up to two orders of magnitude improvement in latency over the best-known existing solutions.

3.
Proc IEEE Int Conf Big Data ; 2022: 252-261, 2022 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-37637192

RESUMEN

Sharing data and computation among concurrent queries has been an active research topic in database systems. While work in this area developed algorithms and systems that are shown to be effective, there is a lack of logical foundation for query processing and optimization. In this paper, we present PsiDB, a system model for processing a large number of database queries in a batch. The key idea is to generate a single query expression that returns a global relation containing all the data needed for individual queries. For that, we propose the use of a type of relational operators called ψ-operators in combining the individual queries into the global expression. We tackle the algebraic optimization problem in PsiDB by developing equivalence rules to transform concurrent queries with the purpose of revealing query optimization opportunities. Centering around the ψ-operator, our rules not only cover many optimization techniques adopted in existing batch processing systems, but also revealed new optimization opportunities. Experiments conducted on an early prototype of PsiDB show a performance improvement of up to 36X over a mainstream commercial DBMS.

4.
Genome Med ; 12(1): 103, 2020 12 02.
Artículo en Inglés | MEDLINE | ID: mdl-33261662

RESUMEN

Whole exome sequencing has been increasingly used in human disease studies. Prioritization based on appropriate functional annotations has been used as an indispensable step to select candidate variants. Here we present the latest updates to dbNSFP (version 4.1), a database designed to facilitate this step by providing deleteriousness prediction and functional annotation for all potential nonsynonymous and splice-site SNVs (a total of 84,013,093) in the human genome. The current version compiled 36 deleteriousness prediction scores, including 12 transcript-specific scores, and other variant and gene-level functional annotations. The database is available at http://database.liulab.science/dbNSFP with a downloadable version and a web-service.


Asunto(s)
Bases de Datos de Ácidos Nucleicos , Genoma Humano , Anotación de Secuencia Molecular , Biología Computacional , Humanos , Mutación , Programas Informáticos , Secuenciación del Exoma
5.
Hum Mutat ; 41(6): 1123-1130, 2020 06.
Artículo en Inglés | MEDLINE | ID: mdl-32227657

RESUMEN

MicroRNAs (miRNA) are short noncoding RNAs that can repress the expression of protein-coding messenger RNAs (mRNAs) by binding to the 3'-untranslated region (UTR) of the target. Genetic mutations such as single nucleotide variants (SNVs) in the 3'-UTR of the mRNAs can disrupt miRNA regulation. In this study, we presented dbMTS, a database for miRNA target site (MTS) SNVs and their functional annotations. This database can help studies easily identify putative SNVs that affect miRNA targeting and facilitate the prioritization of their functional importance. dbMTS is freely available for academic use at http://database.liulab.science/dbMTS as a web service or a downloadable attached database of dbNSFP.


Asunto(s)
Bases de Datos Genéticas , MicroARNs , Polimorfismo de Nucleótido Simple , Regiones no Traducidas 3' , Biología Computacional , Humanos , Internet , MicroARNs/genética , Programas Informáticos
6.
Data Min Knowl Discov ; 34(4): 980-1021, 2020 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38390222

RESUMEN

In recent years, the popularity of graph databases has grown rapidly. This paper focuses on single-graph as an effective model to represent information and its related graph mining techniques. In frequent pattern mining in a single-graph setting, there are two main problems: support measure and search scheme. In this paper, we propose a novel framework for designing support measures that brings together existing minimum-image-based and overlap-graph-based support measures. Our framework is built on the concept of occurrence/instance hypergraphs. Based on such, we are able to design a series of new support measures: minimum instance (MI) measure, and minimum vertex cover (MVC) measure, that combine the advantages of existing measures. More importantly, we show that the existing minimum-image-based support measure is an upper bound of the MI measure, which is also linear-time computable and results in counts that are close to number of instances of a pattern. We show that not only most major existing support measures and new measures proposed in this paper can be mapped into the new framework, but also they occupy different locations of the frequency spectrum. By taking advantage of the new framework, we discover that MVC can be approximated to a constant factor (in terms of number of pattern nodes) in polynomial time. In contrast to common belief, we demonstrate that the state-of-the-art overlap-graph-based maximum independent set (MIS) measure also has constant approximation algorithms. We further show that using standard linear programming and semidefinite programming techniques, polynomial-time relaxations for both MVC and MIS measures can be developed and their counts stand between MVC and MIS. In addition, we point out that MVC, MIS, and their relaxations are bounded within constant factor. In summary, all major support measures are unified in the new hypergraph-based framework which helps reveal their bounding relations and hardness properties.

7.
Proceedings VLDB Endowment ; 14(4): 708-720, 2020 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-38260211

RESUMEN

Relational join processing is one of the core functionalities in database management systems. It has been demonstrated that GPUs as a general-purpose parallel computing platform is very promising in processing relational joins. However, join algorithms often need to handle very large input data, which is an issue that was not sufficiently addressed in existing work. Besides, as more and more desktop and workstation platforms support multi-GPU environment, the combined computing capability of multiple GPUs can easily achieve that of a computing cluster. It is worth exploring how join processing would benefit from the adaptation of multiple GPUs. We identify the low rate and complex patterns of data transfer among the CPU and GPUs as the main challenges in designing efficient algorithms for large table joins. To overcome such challenges, we propose three distinctive designs of multi-GPU join algorithms, namely, the nested loop, global sort-merge and hybrid joins for large table joins with different join conditions. Extensive experiments running on multiple databases and two different hardware configurations demonstrate high scalability of our algorithms over data size and significant performance boost brought by the use of multiple GPUs. Furthermore, our algorithms achieve much better performance as compared to existing join algorithms, with a speedup up to 25X and 2.8X over best known code developed for multi-core CPUs and GPUs respectively.

8.
PLoS One ; 14(4): e0214720, 2019.
Artículo en Inglés | MEDLINE | ID: mdl-30990851

RESUMEN

The unrivaled computing capabilities of modern GPUs meet the demand of processing massive amounts of data seen in many application domains. While traditional HPC systems support applications as standalone entities that occupy entire GPUs, there are GPU-based DBMSs where multiple tasks are meant to be run at the same time in the same device. To that end, system-level resource management mechanisms are needed to fully unleash the computing power of GPUs in large data processing, and there were some researches focusing on it. In our previous work, we explored the single compute-bound kernel modeling on GPUs under NVidia's CUDA framework and provided an in-depth anatomy of the NVidia's concurrent kernel execution mechanism (CUDA stream). This paper focuses on resource allocation of multiple GPU applications towards optimization of system throughput in the context of systems. Comparing to earlier studies of enabling concurrent tasks support on GPU such as MultiQx-GPU, we use a different approach that is to control the launching parameters of multiple GPU kernels as provided by compile-time performance modeling as a kernel-level optimization and also a more general pre-processing model with batch-level control to enhance performance. Specifically, we construct a variation of multi-dimensional knapsack model to maximize concurrency in a multi-kernel environment. We present an in-depth analysis of our model and develop an algorithm based on dynamic programming technique to solve the model. We prove the algorithm can find optimal solutions (in terms of thread concurrency) to the problem and bears pseudopolynomial complexity on both time and space. Such results are verified by extensive experiments running on our microbenchmark that consists of real-world GPU queries. Furthermore, solutions identified by our method also significantly reduce the total running time of the workload, as compared to sequential and MultiQx-GPU executions.


Asunto(s)
Bases de Datos Factuales , Gráficos por Computador , Programas Informáticos
9.
Molecules ; 24(6)2019 Mar 19.
Artículo en Inglés | MEDLINE | ID: mdl-30893816

RESUMEN

Drug-drug interaction (DDI) is becoming a serious issue in clinical pharmacy as the use of multiple medications is more common. The PubMed database is one of the biggest literature resources for DDI studies. It contains over 150,000 journal articles related to DDI and is still expanding at a rapid pace. The extraction of DDI-related information, including compounds and proteins from PubMed, is an essential step for DDI research. In this paper, we introduce a tool, CuDDI (compute unified device architecture-based DDI searching), for identification of DDI-related terms (including compounds and proteins) from PubMed. There are three modules in this application, including the automatic retrieval of substances from PubMed, the identification of DDI-related terms, and the display of relationship of DDI-related terms. For DDI term identification, a speedup of 30⁻105 times was observed for the compute unified device architecture (CUDA)-based version compared with the implementation with a CPU-based Python version. CuDDI can be used to discover DDI-related terms and relationships of these terms, which has the potential to help clinicians and pharmacists better understand the mechanism of DDIs. CuDDI is available at: https://github.com/chengusf/CuDDI.


Asunto(s)
Interacciones Farmacológicas , PubMed , Algoritmos , Minería de Datos , Humanos , Publicaciones , Programas Informáticos
10.
Proc IEEE Int Conf Big Data ; 2019: 533-542, 2019 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-38323298

RESUMEN

Frequent subgraph mining (FSM) from graphs is an active subject in computer science research. One major challenge in FSM is the development of support measures, which are basically functions that map a pattern to its frequency count in a database. Current state-of-the-art in this topic features a hypergraph-based framework for modeling pattern occurrences which unifies the two main flavors of support measures: the overlap-graph based maximum independent set measure (MIS) and minimum image/instance based (MNI) measures. For the purpose of exploring the middle ground between these two groups and guiding the development of new support measures, we present general sufficient conditions for designing new support measures in hypergraph framework, which can be applied to MNI and other support measures that are not included in the overlap graph framework. We utilize the sufficient conditions to generalize MNI and minimum-instance measure (MI) for designing user-defined linear-time measures. Furthermore, we show that a maximum independent subedge set (MISS) measure developed from the sufficient conditions can fill the gap between MIS and MI in computation complexity and support count.

11.
Biochim Biophys Acta Biomembr ; 1859(12): 2297-2307, 2017 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-28882547

RESUMEN

Dissimilarities in the bulk structure of bilayers composed of ether- vs ester-linked lipids are well-established; however, the atomistic interactions responsible for these differences are not well known. These differences are important in understanding of why archaea have a different bilayer composition than the other domains of life and why humans have larger concentrations of plasmalogens in specialized membranes? In this paper, we simulate two lipid bilayers, the ester linked dipalmitoylphosphatidylcholine (DPPC) and the ether lined dihexadecylphosphatidylcholine (DHPC), to study these variations. The structural analysis of the bilayers reveals that DPPC is more compressible than DHPC. A closer examination of dipole potential shows DHPC, despite having a smaller dipole potential of the bilayer, has a higher potential barrier than DPPC at the surface. Analysis of water order and dynamics suggests DHPC has a more ordered, less mobile layer of water in the headgroup. These results seem to resolve the issue as to whether the decrease in permeability of DHPC is due to of differences in minimum area per lipid (A0) or diffusion coefficient of water in the headgroup region (Dhead) (Guler et al., 2009) since we have shown significant changes in the order and mobility of water in that region.


Asunto(s)
1,2-Dipalmitoilfosfatidilcolina/química , Membrana Dobles de Lípidos/química , Simulación de Dinámica Molecular , Éteres Fosfolípidos/química , Agua/química , Cinética , Permeabilidad , Electricidad Estática , Temperatura , Termodinámica
12.
PLoS One ; 12(4): e0173548, 2017.
Artículo en Inglés | MEDLINE | ID: mdl-28422961

RESUMEN

Identifying drug-drug interaction (DDI) is an important topic for the development of safe pharmaceutical drugs and for the optimization of multidrug regimens for complex diseases such as cancer and HIV. There have been about 150,000 publications on DDIs in PubMed, which is a great resource for DDI studies. In this paper, we introduced an automatic computational method for the systematic analysis of the mechanism of DDIs using MeSH (Medical Subject Headings) terms from PubMed literature. MeSH term is a controlled vocabulary thesaurus developed by the National Library of Medicine for indexing and annotating articles. Our method can effectively identify DDI-relevant MeSH terms such as drugs, proteins and phenomena with high accuracy. The connections among these MeSH terms were investigated by using co-occurrence heatmaps and social network analysis. Our approach can be used to visualize relationships of DDI terms, which has the potential to help users better understand DDIs. As the volume of PubMed records increases, our method for automatic analysis of DDIs from the PubMed database will become more accurate.


Asunto(s)
Interacciones Farmacológicas , Medical Subject Headings , Medicamentos bajo Prescripción/farmacología , PubMed/estadística & datos numéricos , Algoritmos , Sistema Enzimático del Citocromo P-450/metabolismo , Humanos , Curva ROC
13.
Proc ACM SIGMOD Int Conf Manag Data ; 2017: 391-402, 2017 May.
Artículo en Inglés | MEDLINE | ID: mdl-38425568

RESUMEN

In recent years, the popularity of graph databases has grown rapidly. This paper focuses on single-graph as an effective model to represent information and its related graph mining techniques. In frequent pattern mining in a single-graph setting, there are two main problems: support measure and search scheme. In this paper, we propose a novel framework for constructing support measures that brings together existing minimum-image-based and overlap-graph-based support measures. Our framework is built on the concept of occurrence / instance hypergraphs. Based on that, we present two new support measures: minimum instance (MI) measure and minimum vertex cover (MVC) measure, that combine the advantages of existing measures. In particular, we show that the existing minimum-image-based support measure is an upper bound of the MI measure, which is also linear-time computable and results in counts that are close to number of instances of a pattern. Although the MVC measure is NP-hard, it can be approximated to a constant factor in polynomial time. We also provide polynomial-time relaxations for both measures and bounding theorems for all presented support measures in the hypergraph setting. We further show that the hypergraph-based framework can unify all support measures studied in this paper. This framework is also flexible in that more variants of support measures can be defined and profiled in it.

14.
Artículo en Inglés | MEDLINE | ID: mdl-38298773

RESUMEN

Processing relational joins on modern GPUs has attracted much attention in the past few years. With the rapid development on the hardware and software environment in the GPU world, the existing GPU join algorithms designed for earlier architecture cannot make the most out of latest GPU products. In this paper, we report new design and implementation of join algorithms with high performance under today's GPGPU environment. This is a key component of our scientific database engine named G-SDMS. In particular, we overhaul the popular radix hash join and redesign sort-merge join algorithms on GPUs by applying a series of novel techniques to utilize the hardware capacity of latest Nvidia GPU architecture and new features of the CUDA programming framework. Our algorithms take advantage of revised hardware arrangement, larger register file and shared memory, native atomic operation, dynamic parallelism, and CUDA Streams. Experiments show that our new hash join algorithm is 2.0 to 14.6 times as efficient as existing GPU implementation, while the new sort-merge join achieves a speedup of 4.0X to 4.9X. Compared to the best CPU sort-merge join and hash join known to date, our optimized code achieves up to 10.5X and 5.5X speedup. Moreover, we extend our design to scenarios where large data tables cannot fit in the GPU memory.

15.
Sci Rep ; 5: 17357, 2015 Nov 27.
Artículo en Inglés | MEDLINE | ID: mdl-26612138

RESUMEN

Drug-drug interaction (DDI) is becoming a serious clinical safety issue as the use of multiple medications becomes more common. Searching the MEDLINE database for journal articles related to DDI produces over 330,000 results. It is impossible to read and summarize these references manually. As the volume of biomedical reference in the MEDLINE database continues to expand at a rapid pace, automatic identification of DDIs from literature is becoming increasingly important. In this article, we present a random-sampling-based statistical algorithm to identify possible DDIs and the underlying mechanism from the substances field of MEDLINE records. The substances terms are essentially carriers of compound (including protein) information in a MEDLINE record. Four case studies on warfarin, ibuprofen, furosemide and sertraline implied that our method was able to rank possible DDIs with high accuracy (90.0% for warfarin, 83.3% for ibuprofen, 70.0% for furosemide and 100% for sertraline in the top 10% of a list of compounds ranked by p-value). A social network analysis of substance terms was also performed to construct networks between proteins and drug pairs to elucidate how the two drugs could interact.


Asunto(s)
Algoritmos , Furosemida/uso terapéutico , Ibuprofeno/uso terapéutico , Sertralina/uso terapéutico , Warfarina/uso terapéutico , Contraindicaciones , Interacciones Farmacológicas , Enzimas/metabolismo , Humanos , Hidrólisis , MEDLINE/estadística & datos numéricos
16.
J Big Data ; 2(1): 9, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-26069879

RESUMEN

Molecular Simulation (MS) is a powerful tool for studying physical/chemical features of large systems and has seen applications in many scientific and engineering domains. During the simulation process, the experiments generate a very large number of atoms and intend to observe their spatial and temporal relationships for scientific analysis. The sheer data volumes and their intensive interactions impose significant challenges for data accessing, managing, and analysis. To date, existing MS software systems fall short on storage and handling of MS data, mainly because of the missing of a platform to support applications that involve intensive data access and analytical process. In this paper, we present the database-centric molecular simulation (DCMS) system our team developed in the past few years. The main idea behind DCMS is to store MS data in a relational database management system (DBMS) to take advantage of the declarative query interface (i.e., SQL), data access methods, query processing, and optimization mechanisms of modern DBMSs. A unique challenge is to handle the analytical queries that are often compute-intensive. For that, we developed novel indexing and query processing strategies (including algorithms running on modern co-processors) as integrated components of the DBMS. As a result, researchers can upload and analyze their data using efficient functions implemented inside the DBMS. Index structures are generated to store analysis results that may be interesting to other users, so that the results are readily available without duplicating the analysis. We have developed a prototype of DCMS based on the PostgreSQL system and experiments using real MS data and workload show that DCMS significantly outperforms existing MS software systems. We also used it as a platform to test other data management issues such as security and compression.

17.
Proc Int Conf Web Inf Syst Eng ; 9419: 458-472, 2015 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-26997936

RESUMEN

Privacy and usage restriction issues are important when valuable data are exchanged or acquired by different organizations. Standard access control mechanisms either restrict or completely grant access to valuable data. On the other hand, data obfuscation limits the overall usability and may result in loss of total value. There are no standard policy enforcement mechanisms for data acquired through mutual and copyright agreements. In practice, many different types of policies can be enforced in protecting data privacy. Hence there is the need for an unified framework that encapsulates multiple suites of policies to protect the data. We present our vision of an architecture named security automata model (SAM) to enforce privacy-preserving policies and usage restrictions. SAM analyzes the input queries and their outputs to enforce various policies, liberating data owners from the burden of monitoring data access. SAM allows administrators to specify various policies and enforces them to monitor queries and control the data access. Our goal is to address the problems of data usage control and protection through privacy policies that can be defined, enforced, and integrated with the existing access control mechanisms using SAM. In this paper, we lay out the theoretical foundation of SAM, which is based on an automata named Mandatory Result Automata. We also discuss the major challenges of implementing SAM in a real-world database environment as well as ideas to meet such challenges.

18.
IEEE Trans Knowl Data Eng ; 26(10): 2410-2424, 2014 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-25264418

RESUMEN

This paper focuses on an important query in scientific simulation data analysis: the Spatial Distance Histogram (SDH). The computation time of an SDH query using brute force method is quadratic. Often, such queries are executed continuously over certain time periods, increasing the computation time. We propose highly efficient approximate algorithm to compute SDH over consecutive time periods with provable error bounds. The key idea of our algorithm is to derive statistical distribution of distances from the spatial and temporal characteristics of particles. Upon organizing the data into a Quad-tree based structure, the spatiotemporal characteristics of particles in each node of the tree are acquired to determine the particles' spatial distribution as well as their temporal locality in consecutive time periods. We report our efforts in implementing and optimizing the above algorithm in Graphics Processing Units (GPUs) as means to further improve the efficiency. The accuracy and efficiency of the proposed algorithm is backed by mathematical analysis and results of extensive experiments using data generated from real simulation studies.

19.
Proc IEEE Int Conf Big Data ; 2014: 301-310, 2014 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-26566545

RESUMEN

Push-based database management system (DBMS) is a new type of data processing software that streams large volume of data to concurrent query operators. The high data rate of such systems requires large computing power provided by the query engine. In our previous work, we built a push-based DBMS named G-SDMS to harness the unrivaled computational capabilities of modern GPUs. A major design goal of G-SDMS is to support concurrent processing of heterogenous query processing operations and enable resource allocation among such operations. Understanding the performance of operations as a result of resource consumption is thus a premise in the design of G-SDMS. With NVIDIA's CUDA framework as the system implementation platform, we present our recent work on performance modeling of CUDA kernels running concurrently under a runtime mechanism named CUDA stream. Specifically, we explore the connection between performance and resource occupancy of compute-bound kernels and develop a model that can predict the performance of such kernels. Furthermore, we provide an in-depth anatomy of the CUDA stream mechanism and summarize the main kernel scheduling disciplines in it. Our models and derived scheduling disciplines are verified by extensive experiments using synthetic and real-world CUDA kernels.

20.
Artículo en Inglés | MEDLINE | ID: mdl-26973983

RESUMEN

In this paper, we present a compression framework, for molecular dynamics (MD) simulation data, which yields significant performance by combining the strength of principal component analysis (PCA) and discrete cosine transform (DCT). Though it is a lossy compression technique, the effect on analytics performed on decompressed data is very minimal. Compression ratio up to 13 is achieved with acceptable errors in results of analytical functions.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...