RESUMO
Technological advances in the past decade, hardware and software alike, have made access to high-performance computing (HPC) easier than ever. We review these advances from a statistical computing perspective. Cloud computing makes access to supercomputers affordable. Deep learning software libraries make programming statistical algorithms easy and enable users to write code once and run it anywhere-from a laptop to a workstation with multiple graphics processing units (GPUs) or a supercomputer in a cloud. Highlighting how these developments benefit statisticians, we review recent optimization algorithms that are useful for high-dimensional models and can harness the power of HPC. Code snippets are provided to demonstrate the ease of programming. We also provide an easy-to-use distributed matrix data structure suitable for HPC. Employing this data structure, we illustrate various statistical applications including large-scale positron emission tomography and â1-regularized Cox regression. Our examples easily scale up to an 8-GPU workstation and a 720-CPU-core cluster in a cloud. As a case in point, we analyze the onset of type-2 diabetes from the UK Biobank with 200,000 subjects and about 500,000 single nucleotide polymorphisms using the HPC â1-regularized Cox regression. Fitting this half-million-variate model takes less than 45 minutes and reconfirms known associations. To our knowledge, this is the first demonstration of the feasibility of penalized regression of survival outcomes at this scale.
RESUMO
In recent years, cinematography and other digital content creators have been eagerly turning to Three-Dimensional (3D) imaging technology. The creators of movies, games, and augmented reality applications are aware of this technology's advantages, possibilities, and new means of expression. The development of electronic and IT technologies enables the achievement of a better and better quality of the recorded 3D image and many possibilities for its correction and modification in post-production. However, preparing a correct 3D image that does not cause perception problems for the viewer is still a complex and demanding task. Therefore, planning and then ensuring the correct parameters and quality of the recorded 3D video is essential. Despite better post-production techniques, fixing errors in a captured image can be difficult, time consuming, and sometimes impossible. The detection of errors typical for stereo vision related to the depth of the image (e.g., depth budget violation, stereoscopic window violation) during the recording allows for their correction already on the film set, e.g., by different scene layouts and/or different camera configurations. The paper presents a prototype of an independent, non-invasive diagnostic system that supports the film crew in the process of calibrating stereoscopic cameras, as well as analysing the 3D depth while working on a film set. The system acquires full HD video streams from professional cameras using Serial Digital Interface (SDI), synchronises them, and estimates and analyses the disparity map. Objective depth analysis using computer tools while recording scenes allows stereographers to immediately spot errors in the 3D image, primarily related to the violation of the viewing comfort zone. The paper also describes an efficient method of analysing a 3D video using Graphics Processing Unit (GPU). The main steps of the proposed solution are uncalibrated rectification and disparity map estimation. The algorithms selected and implemented for the needs of this system do not require knowledge of intrinsic and extrinsic camera parameters. Thus, they can be used in non-cooperative environments, such as a film set, where the camera configuration often changes. Both of them are implemented with the use of a GPU to improve the data processing efficiency. The paper presents the evaluation results of the algorithms' accuracy, as well as the comparison of the performance of two implementations-with and without the GPU acceleration. The application of the described GPU-based method makes the system efficient and easy to use. The system can process a video stream with full HD resolution at a speed of several frames per second.
Assuntos
Algoritmos , Imageamento Tridimensional , FotografaçãoRESUMO
This paper proposes a framework for the wireless sensor data acquisition using a team of Unmanned Aerial Vehicles (UAVs). Scattered over a terrain, the sensors detect information about their surroundings and can transmit this information wirelessly over a short range. With no access to a terrestrial or satellite communication network to relay the information to, UAVs are used to visit the sensors and collect the data. The proposed framework uses an iterative k-means algorithm to group the sensors into clusters and to identify Download Points (DPs) where the UAVs hover to download the data. A Single-Source-Shortest-Path algorithm (SSSP) is used to compute optimal paths between every pair of DPs with a constraint to reduce the number of turns. A genetic algorithm supplemented with a 2-opt local search heuristic is used to solve the multi-travelling salesperson problem and to find optimized tours for each UAVs. Finally, a collision avoidance strategy is implemented to guarantee collision-free trajectories. Concerned with the overall runtime of the framework, the SSSP algorithm is implemented in parallel on a graphics processing unit. The proposed framework is tested in simulation using three UAVs and realistic 3D maps with up to 100 sensors and runs in just 20.7 s, a 33.3× speed-up compared to a sequential execution on CPU. The results show that the proposed method is efficient at calculating optimized trajectories for the UAVs for data acquisition from wireless sensors. The results also show the significant advantage of the parallel implementation on GPU.
RESUMO
BACKGROUND: The identification of all matches of a large set of position weight matrices (PWMs) in long DNA sequences requires significant computational resources for which a number of efficient yet complex algorithms have been proposed. RESULTS: We propose BLAMM, a simple and efficient tool inspired by high performance computing techniques. The workload is expressed in terms of matrix-matrix products that are evaluated with high efficiency using optimized BLAS library implementations. The algorithm is easy to parallelize and implement on CPUs and GPUs and has a runtime that is independent of the selected p-value. In terms of single-core performance, it is competitive with state-of-the-art software for PWM matching while being much more efficient when using multithreading. Additionally, BLAMM requires negligible memory. For example, both strands of the entire human genome can be scanned for 1404 PWMs in the JASPAR database in 13 min with a p-value of 10-4 using a 36-core machine. On a dual GPU system, the same task can be performed in under 5 min. CONCLUSIONS: BLAMM is an efficient tool for identifying PWM matches in large DNA sequences. Its C++ source code is available under the GNU General Public License Version 3 at https://github.com/biointec/blamm.
Assuntos
Algoritmos , Interface Usuário-Computador , Metodologias Computacionais , Humanos , Matrizes de Pontuação de Posição EspecíficaRESUMO
BACKGROUND: The data deluge can leverage sophisticated ML techniques for functionally annotating the regulatory non-coding genome. The challenge lies in selecting the appropriate classifier for the specific functional annotation problem, within the bounds of the hardware constraints and the model's complexity. In our system AIKYATAN, we annotate distal epigenomic regulatory sites, e.g., enhancers. Specifically, we develop a binary classifier that classifies genome sequences as distal regulatory regions or not, given their histone modifications' combinatorial signatures. This problem is challenging because the regulatory regions are distal to the genes, with diverse signatures across classes (e.g., enhancers and insulators) and even within each class (e.g., different enhancer sub-classes). RESULTS: We develop a suite of ML models, under the banner AIKYATAN, including SVM models, random forest variants, and deep learning architectures, for distal regulatory element (DRE) detection. We demonstrate, with strong empirical evidence, deep learning approaches have a computational advantage. Plus, convolutional neural networks (CNN) provide the best-in-class accuracy, superior to the vanilla variant. With the human embryonic cell line H1, CNN achieves an accuracy of 97.9% and an order of magnitude lower runtime than the kernel SVM. Running on a GPU, the training time is sped up 21x and 30x (over CPU) for DNN and CNN, respectively. Finally, our CNN model enjoys superior prediction performance vis-'a-vis the competition. Specifically, AIKYATAN-CNN achieved 40% higher validation rate versus CSIANN and the same accuracy as RFECS. CONCLUSIONS: Our exhaustive experiments using an array of ML tools validate the need for a model that is not only expressive but can scale with increasing data volumes and diversity. In addition, a subset of these datasets have image-like properties and benefit from spatial pooling of features. Our AIKYATAN suite leverages diverse epigenomic datasets that can then be modeled using CNNs with optimized activation and pooling functions. The goal is to capture the salient features of the integrated epigenomic datasets for deciphering the distal (non-coding) regulatory elements, which have been found to be associated with functional variants. Our source code will be made publicly available at: https://bitbucket.org/cellsandmachines/aikyatan.
Assuntos
Mapeamento Cromossômico/métodos , Aprendizado Profundo , Epigenômica/métodos , Sequências Reguladoras de Ácido Nucleico , Software , Linhagem Celular , HumanosRESUMO
Several studies in Bioinformatics, Computational Biology and Systems Biology rely on the definition of physico-chemical or mathematical models of biological systems at different scales and levels of complexity, ranging from the interaction of atoms in single molecules up to genome-wide interaction networks. Traditional computational methods and software tools developed in these research fields share a common trait: they can be computationally demanding on Central Processing Units (CPUs), therefore limiting their applicability in many circumstances. To overcome this issue, general-purpose Graphics Processing Units (GPUs) are gaining an increasing attention by the scientific community, as they can considerably reduce the running time required by standard CPU-based software, and allow more intensive investigations of biological systems. In this review, we present a collection of GPU tools recently developed to perform computational analyses in life science disciplines, emphasizing the advantages and the drawbacks in the use of these parallel architectures. The complete list of GPU-powered tools here reviewed is available at http://bit.ly/gputools.
Assuntos
Biologia de Sistemas , Algoritmos , Gráficos por Computador , SoftwareRESUMO
Road transportation is the backbone of modern economies, albeit it annually costs 1.25 million deaths and trillions of dollars to the global economy, and damages public health and the environment. Deep learning is among the leading-edge methods used for transportation-related predictions, however, the existing works are in their infancy, and fall short in multiple respects, including the use of datasets with limited sizes and scopes, and insufficient depth of the deep learning studies. This paper provides a novel and comprehensive approach toward large-scale, faster, and real-time traffic prediction by bringing four complementary cutting-edge technologies together: big data, deep learning, in-memory computing, and Graphics Processing Units (GPUs). We trained deep networks using over 11 years of data provided by the California Department of Transportation (Caltrans), the largest dataset that has been used in deep learning studies. Several combinations of the input attributes of the data along with various network configurations of the deep learning models were investigated for training and prediction purposes. The use of the pre-trained model for real-time prediction was explored. The paper contributes novel deep learning models, algorithms, implementation, analytics methodology, and software tool for smart cities, big data, high performance computing, and their convergence.
RESUMO
The most widely used quantum-chemical models for excited states are single-excitation theories, a category that includes configuration interaction with single substitutions, time-dependent density functional theory, and also a recently developed ab initio exciton model. When a large number of excited states are desired, these calculations incur a significant bottleneck in the "digestion" step in which two-electron integrals are contracted with density or density-like matrices. We present an implementation that moves this step onto graphical processing units (GPUs), and introduce a double-buffer scheme that minimizes latency by computing integrals on the central processing units (CPUs) concurrently with their digestion on the GPUs. An automatic code generation scheme simplifies the implementation of high-performance GPU kernels. For the exciton model, which requires separate excited-state calculations on each electronically coupled chromophore, the heterogeneous implementation described here results in speedups of 2-6× versus a CPU-only implementation. For traditional time-dependent density functional theory calculations, we obtain speedups of up to 5× when a large number of excited states is computed. © 2018 Wiley Periodicals, Inc.
RESUMO
The use of hyperspectral imaging (HSI) in the medical field is an emerging approach to assist physicians in diagnostic or surgical guidance tasks. However, HSI data processing involves very high computational requirements due to the huge amount of information captured by the sensors. One of the stages with higher computational load is the K-Nearest Neighbors (KNN) filtering algorithm. The main goal of this study is to optimize and parallelize the KNN algorithm by exploiting the GPU technology to obtain real-time processing during brain cancer surgical procedures. This parallel version of the KNN performs the neighbor filtering of a classification map (obtained from a supervised classifier), evaluating the different classes simultaneously. The undertaken optimizations and the computational capabilities of the GPU device throw a speedup up to 66.18× when compared to a sequential implementation.
Assuntos
Algoritmos , Neoplasias Encefálicas/classificação , Neoplasias Encefálicas/diagnóstico por imagem , Sistemas Computacionais , Encéfalo , Análise por Conglomerados , HumanosRESUMO
The capabilities of the polarizable force fields for alchemical free energy calculations have been limited by the high computational cost and complexity of the underlying potential energy functions. In this work, we present a GPU-based general alchemical free energy simulation platform for polarizable potential AMOEBA. Tinker-OpenMM, the OpenMM implementation of the AMOEBA simulation engine has been modified to enable both absolute and relative alchemical simulations on GPUs, which leads to a â¼200-fold improvement in simulation speed over a single CPU core. We show that free energy values calculated using this platform agree with the results of Tinker simulations for the hydration of organic compounds and binding of host-guest systems within the statistical errors. In addition to absolute binding, we designed a relative alchemical approach for computing relative binding affinities of ligands to the same host, where a special path was applied to avoid numerical instability due to polarization between the different ligands that bind to the same site. This scheme is general and does not require ligands to have similar scaffolds. We show that relative hydration and binding free energy calculated using this approach match those computed from the absolute free energy approach. © 2017 Wiley Periodicals, Inc.
Assuntos
Gráficos por Computador , Modelos Químicos , Simulação de Dinâmica Molecular , Termodinâmica , LigantesRESUMO
The kernel RX (KRX) detector proposed by Kwon and Nasrabadi exploits a kernel function to obtain a better detection performance. However, it still has two limits that can be improved. On the one hand, reasonable integration of spatial-spectral information can be used to further improve its detection accuracy. On the other hand, parallel computing can be used to reduce the processing time in available KRX detectors. Accordingly, this paper presents a novel weighted spatial-spectral kernel RX (WSSKRX) detector and its parallel implementation on graphics processing units (GPUs). The WSSKRX utilizes the spatial neighborhood resources to reconstruct the testing pixels by introducing a spectral factor and a spatial window, thereby effectively reducing the interference of background noise. Then, the kernel function is redesigned as a mapping trick in a KRX detector to implement the anomaly detection. In addition, a powerful architecture based on the GPU technique is designed to accelerate WSSKRX. To substantiate the performance of the proposed algorithm, both synthetic and real data are conducted for experiments.
RESUMO
BACKGROUND: During library construction polymerase chain reaction is used to enrich the DNA before sequencing. Typically, this process generates duplicate read sequences. Removal of these artifacts is mandatory, as they can affect the correct interpretation of data in several analyses. Ideally, duplicate reads should be characterized by identical nucleotide sequences. However, due to sequencing errors, duplicates may also be nearly-identical. Removing nearly-identical duplicates can result in a notable computational effort. To deal with this challenge, we recently proposed a GPU method aimed at removing identical and nearly-identical duplicates generated with an Illumina platform. The method implements an approach based on prefix-suffix comparison. Read sequences with identical prefix are considered potential duplicates. Then, their suffixes are compared to identify and remove those that are actually duplicated. Although the method can be efficiently used to remove duplicates, there are some limitations that need to be overcome. In particular, it cannot to detect potential duplicates in the event that prefixes are longer than 27 bases, and it does not provide support for paired-end read libraries. Moreover, large clusters of potential duplicates are split into smaller with the aim to guarantees a reasonable computing time. This heuristic may affect the accuracy of the analysis. RESULTS: In this work we propose GPU-DupRemoval, a new implementation of our method able to (i) cluster reads without constraints on the maximum length of the prefixes, (ii) support both single- and paired-end read libraries, and (iii) analyze large clusters of potential duplicates. CONCLUSIONS: Due to the massive parallelization obtained by exploiting graphics cards, GPU-DupRemoval removes duplicate reads faster than other cutting-edge solutions, while outperforming most of them in terms of amount of duplicates reads.
Assuntos
Biologia Computacional/métodos , DNA/genética , Análise de Sequência de DNA/métodos , Algoritmos , Reação em Cadeia da PolimeraseRESUMO
We apply multireference electronic structure calculations to demonstrate the presence of conical intersections between the ground and the first excited electronic states of three silicon nanocrystals containing defects characteristic of the oxidized silicon surface. These intersections are accessible upon excitation at visible wavelengths and are predicted to facilitate nonradiative recombination with a rate that increases with decreasing particle size. This work illustrates a new framework for identifying defects responsible for nonradiative recombination.
RESUMO
A custom code for molecular dynamics simulations has been designed to run on CUDA-enabled NVIDIA graphics processing units (GPUs). The double-precision code simulates multicomponent fluids, with intramolecular and intermolecular forces, coarse-grained and atomistic models, holonomic constraints, Nosé-Hoover thermostats, and the generation of distribution functions. Algorithms to compute Lennard-Jones and Gay-Berne interactions, and the electrostatic force using Ewald summations, are discussed. A neighbor list is introduced to improve scaling with respect to system size. Three test systems are examined: SPC/E water; an n-hexane/2-propanol mixture; and a liquid crystal mesogen, 2-(4-butyloxyphenyl)-5-octyloxypyrimidine. Code performance is analyzed for each system. With one GPU, a 33-119 fold increase in performance is achieved compared with the serial code while the use of two GPUs leads to a 69-287 fold improvement and three GPUs yield a 101-377 fold speedup.
RESUMO
Infants born preterm or small for gestational age have elevated rates of morbidity and mortality. Using birth certificate records in Texas from 2002 to 2004 and Environmental Protection Agency air pollution estimates, we relate the quantile functions of birth weight and gestational age to ozone exposure and multiple predictors, including parental age, race, and education level. We introduce a semi-parametric Bayesian quantile approach that models the full quantile function rather than just a few quantile levels. Our multilevel quantile function model establishes relationships between birth weight and the predictors separately for each week of gestational age and between gestational age and the predictors separately across Texas Public Health Regions. We permit these relationships to vary nonlinearly across gestational age, spatial domain and quantile level and we unite them in a hierarchical model via a basis expansion on the regression coefficients that preserves interpretability. Very low birth weight is a primary concern, so we leverage extreme value theory to supplement our model in the tail of the distribution. Gestational ages are recorded in completed weeks of gestation (integer-valued), so we present methodology for modeling quantile functions of discrete response data. In a simulation study we show that pooling information across gestational age and quantile level substantially reduces MSE of predictor effects. We find that ozone is negatively associated with the lower tail of gestational age in south Texas and across the distribution of birth weight for high gestational ages. Our methods are available in the R package BSquare.
Assuntos
Mortalidade Infantil , Recém-Nascido Prematuro , Recém-Nascido Pequeno para a Idade Gestacional , Poluentes Atmosféricos/efeitos adversos , Teorema de Bayes , Biometria , Peso ao Nascer , Simulação por Computador , Feminino , Idade Gestacional , Humanos , Lactente , Recém-Nascido , Masculino , Modelos Estatísticos , Ozônio/efeitos adversos , Gravidez , Análise de Regressão , Texas/epidemiologiaRESUMO
PURPOSE: To develop a fast patient-specific analytical estimator of first-order Compton and Rayleigh scatter in cone-beam computed tomography, implemented using graphics processing units. METHODS: The authors developed an analytical estimator for first-order Compton and Rayleigh scatter in a cone-beam computed tomography geometry. The estimator was coded using NVIDIA's CUDA environment for execution on an NVIDIA graphics processing unit. Performance of the analytical estimator was validated by comparison with high-count Monte Carlo simulations for two different numerical phantoms. Monoenergetic analytical simulations were compared with monoenergetic and polyenergetic Monte Carlo simulations. Analytical and Monte Carlo scatter estimates were compared both qualitatively, from visual inspection of images and profiles, and quantitatively, using a scaled root-mean-square difference metric. Reconstruction of simulated cone-beam projection data of an anthropomorphic breast phantom illustrated the potential of this method as a component of a scatter correction algorithm. RESULTS: The monoenergetic analytical and Monte Carlo scatter estimates showed very good agreement. The monoenergetic analytical estimates showed good agreement for Compton single scatter and reasonable agreement for Rayleigh single scatter when compared with polyenergetic Monte Carlo estimates. For a voxelized phantom with dimensions 128 × 128 × 128 voxels and a detector with 256 × 256 pixels, the analytical estimator required 669 seconds for a single projection, using a single NVIDIA 9800 GX2 video card. Accounting for first order scatter in cone-beam image reconstruction improves the contrast to noise ratio of the reconstructed images. CONCLUSION: The analytical scatter estimator, implemented using graphics processing units, provides rapid and accurate estimates of single scatter and with further acceleration and a method to account for multiple scatter may be useful for practical scatter correction schemes.
Assuntos
Tomografia Computadorizada de Feixe Cônico/métodos , Processamento de Imagem Assistida por Computador/métodos , Feminino , Humanos , Mamografia , Modelos Biológicos , Método de Monte Carlo , Imagens de Fantasmas , Reprodutibilidade dos Testes , Espalhamento de RadiaçãoRESUMO
Using a grid-based method to search the critical points in electron density, we show how to accelerate such a method with graphics processing units (GPUs). When the GPU implementation is contrasted with that used on central processing units (CPUs), we found a large difference between the time elapsed by both implementations: the smallest time is observed when GPUs are used. We tested two GPUs, one related with video games and other used for high-performance computing (HPC). By the side of the CPUs, two processors were tested, one used in common personal computers and other used for HPC, both of last generation. Although our parallel algorithm scales quite well on CPUs, the same implementation on GPUs runs around 10× faster than 16 CPUs, with any of the tested GPUs and CPUs. We have found what one GPU dedicated for video games can be used without any problem for our application, delivering a remarkable performance, in fact; this GPU competes against one HPC GPU, in particular when single-precision is used.
RESUMO
Training Deep Neural Networks (DNNs) places immense compute requirements on the underlying hardware platforms, expending large amounts of time and energy. An important factor contributing to the long training times is the increasing dataset complexity required to reach state-of-the-art performance in real-world applications. To address this challenge, we explore the use of input mixing, where multiple inputs are combined into a single composite input with an associated composite label for training. The goal is for training on the mixed input to achieve a similar effect as training separately on each the constituent inputs that it represents. This results in a lower number of inputs (or mini-batches) to be processed in each epoch, proportionally reducing training time. We find that naive input mixing leads to a considerable drop in learning performance and model accuracy due to interference between the forward/backward propagation of the mixed inputs. We propose two strategies to address this challenge and realize training speedups from input mixing with minimal impact on accuracy. First, we reduce the impact of inter-input interference by exploiting the spatial separation between the features of the constituent inputs in the network's intermediate representations. We also adaptively vary the mixing ratio of constituent inputs based on their loss in previous epochs. Second, we propose heuristics to automatically identify the subset of the training dataset that is subject to mixing in each epoch. Across ResNets of varying depth, MobileNetV2 and two Vision Transformer networks, we obtain upto 1.6 × and 1.8 × speedups in training for the ImageNet and Cifar10 datasets, respectively, on an Nvidia RTX 2080Ti GPU, with negligible loss in classification accuracy.
RESUMO
Nowadays, molecular dynamics (MD) simulations of proteins with hundreds of thousands of snapshots are commonly produced using modern GPUs. However, due to the abundance of data, analyzing transport tunnels present in the internal voids of these molecules, in all generated snapshots, has become challenging. Here, we propose to combine the usage of CAVER3, the most popular tool for tunnel calculation, and the TransportTools Python3 library into a divide-and-conquer approach to speed up tunnel calculation and reduce the hardware resources required to analyze long MD simulations in detail. By slicing an MD trajectory into smaller pieces and performing a tunnel analysis on these pieces by CAVER3, the runtime and resources are considerably reduced. Next, the TransportTools library merges the smaller pieces and gives an overall view of the tunnel network for the complete trajectory without quality loss.
RESUMO
Recently, high-resolution gamma cameras have been developed with detectors containing > 105-106 elements. Single-photon emission computed tomography (SPECT) imagers based on these detectors usually also have a large number of voxel bins and therefore face memory storage issues for the system matrix when performing fast tomographic reconstructions using iterative algorithms. To address these issues, we have developed a method that parameterizes the detector response to a point source and generates the system matrix on the fly during MLEM or OSEM on graphics hardware. The calibration method, interpolation of coefficient data, and reconstruction results are presented in the context of a recently commissioned small-animal SPECT imager, called FastSPECT III.