Pesquisa | Biblioteca Virtual em Saúde

Humphries, Benjamin; Zhang, Hansen; Sheng, Jiayi; Landaverde, Raphael; Herbordt, Martin C.

Proc IEEE Int Symp Field Program Cust Comput Mach ; 2014: 68-71, 2014 May.

Artigo em Inglês | MEDLINE | ID: mdl-26594666

RESUMO

The 3D FFT is critical in many physical simulations and image processing applications. On FPGAs, however, the 3D FFT was thought to be inefficient relative to other methods such as convolution-based implementations of multi-grid. We find the opposite: a simple design, operating at a conservative frequency, takes 4µs for 163, 21µs for 323, and 215µs for 643 single precision data points. The first two of these compare favorably with the 25µs and 29µs obtained running on a current Nvidia GPU. Some broader significance is that this is a critical piece in implementing a large scale FPGA-based MD engine: even a single FPGA is capable of keeping the FFT off of the critical path for a large fraction of possible MD simulations.

GPU Optimizations for a Production Molecular Docking Code.

Landaverde, Raphael; Herbordt, Martin C.

IEEE Conf High Perform Extreme Comput ; 20142014 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-26594667

RESUMO

Modeling molecular docking is critical to both understanding life processes and designing new drugs. In previous work we created the first published GPU-accelerated docking code (PIPER) which achieved a roughly 5× speed-up over a contemporaneous 4 core CPU. Advances in GPU architecture and in the CPU code, however, have since reduced this relalative performance by a factor of 10. In this paper we describe the upgrade of GPU PIPER. This required an entire rewrite, including algorithm changes and moving most remaining non-accelerated CPU code onto the GPU. The result is a 7× improvement in GPU performance and a 3.3× speedup over the CPU-only code. We find that this difference in time is almost entirely due to the difference in run times of the 3D FFT library functions on CPU (MKL) and GPU (cuFFT), respectively. The GPU code has been integrated into the ClusPro docking server which has over 4000 active users.

An Investigation of Unified Memory Access Performance in CUDA.

Landaverde, Raphael; Zhang, Tiansheng; Coskun, Ayse K; Herbordt, Martin.

IEEE Conf High Perform Extreme Comput ; 20142014 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-26594668

RESUMO

Managing memory between the CPU and GPU is a major challenge in GPU computing. A programming model, Unified Memory Access (UMA), has been recently introduced by Nvidia to simplify the complexities of memory management while claiming good overall performance. In this paper, we investigate this programming model and evaluate its performance and programming model simplifications based on our experimental results. We find that beyond on-demand data transfers to the CPU, the GPU is also able to request subsets of data it requires on demand. This feature allows UMA to outperform full data transfer methods for certain parallel applications and small data sizes. We also find, however, that for the majority of applications and memory access patterns, the performance overheads associated with UMA are significant, while the simplifications to the programming model restrict flexibility for adding future optimizations.

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA