Pesquisa | BVS - MINISTÉRIO DA SAÚDE

Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based accelerators.

Rasch, Malte J; Mackin, Charles; Le Gallo, Manuel; Chen, An; Fasoli, Andrea; Odermatt, Frédéric; Li, Ning; Nandakumar, S R; Narayanan, Pritish; Tsai, Hsinyu; Burr, Geoffrey W; Sebastian, Abu; Narayanan, Vijay.

Nat Commun ; 14(1): 5282, 2023 Aug 30.

Artigo em Inglês | MEDLINE | ID: mdl-37648721

RESUMO

Analog in-memory computing-a promising approach for energy-efficient acceleration of deep learning workloads-computes matrix-vector multiplications but only approximately, due to nonidealities that often are non-deterministic or nonlinear. This can adversely impact the achievable inference accuracy. Here, we develop an hardware-aware retraining approach to systematically examine the accuracy of analog in-memory computing across multiple network topologies, and investigate sensitivity and robustness to a broad set of nonidealities. By introducing a realistic crossbar model, we improve significantly on earlier retraining approaches. We show that many larger-scale deep neural networks-including convnets, recurrent networks, and transformers-can in fact be successfully retrained to show iso-accuracy with the floating point implementation. Our results further suggest that nonidealities that add noise to the inputs or outputs, not the weights, have the largest impact on accuracy, and that recurrent networks are particularly robust to all nonidealities.

Optimised weight programming for analogue memory-based deep neural networks.

Mackin, Charles; Rasch, Malte J; Chen, An; Timcheck, Jonathan; Bruce, Robert L; Li, Ning; Narayanan, Pritish; Ambrogio, Stefano; Le Gallo, Manuel; Nandakumar, S R; Fasoli, Andrea; Luquin, Jose; Friz, Alexander; Sebastian, Abu; Tsai, Hsinyu; Burr, Geoffrey W.

Nat Commun ; 13(1): 3765, 2022 06 30.

Artigo em Inglês | MEDLINE | ID: mdl-35773285

RESUMO

Analogue memory-based deep neural networks provide energy-efficiency and per-area throughput gains relative to state-of-the-art digital counterparts such as graphics processing units. Recent advances focus largely on hardware-aware algorithmic training and improvements to circuits, architectures, and memory devices. Optimal translation of software-trained weights into analogue hardware weights-given the plethora of complex memory non-idealities-represents an equally important task. We report a generalised computational framework that automates the crafting of complex weight programming strategies to minimise accuracy degradations during inference, particularly over time. The framework is agnostic to network structure and generalises well across recurrent, convolutional, and transformer neural networks. As a highly flexible numerical heuristic, the approach accommodates arbitrary device-level complexity, making it potentially relevant for a variety of analogue memories. By quantifying the limit of achievable inference accuracy, it also enables analogue memory-based deep neural network accelerators to reach their full inference potential.

Assuntos

Redes Neurais de Computação , Software , Computadores

Toward Software-Equivalent Accuracy on Transformer-Based Deep Neural Networks With Analog Memory Devices.

Spoon, Katie; Tsai, Hsinyu; Chen, An; Rasch, Malte J; Ambrogio, Stefano; Mackin, Charles; Fasoli, Andrea; Friz, Alexander M; Narayanan, Pritish; Stanisavljevic, Milos; Burr, Geoffrey W.

Front Comput Neurosci ; 15: 675741, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34290595

RESUMO

Recent advances in deep learning have been driven by ever-increasing model sizes, with networks growing to millions or even billions of parameters. Such enormous models call for fast and energy-efficient hardware accelerators. We study the potential of Analog AI accelerators based on Non-Volatile Memory, in particular Phase Change Memory (PCM), for software-equivalent accurate inference of natural language processing applications. We demonstrate a path to software-equivalent accuracy for the GLUE benchmark on BERT (Bidirectional Encoder Representations from Transformers), by combining noise-aware training to combat inherent PCM drift and noise sources, together with reduced-precision digital attention-block computation down to INT6.

Training fully connected networks with resistive memories: impact of device failures.

Romero, Louis P; Ambrogio, Stefano; Giordano, Massimo; Cristiano, Giorgio; Bodini, Martina; Narayanan, Pritish; Tsai, Hsinyu; Shelby, Robert M; Burr, Geoffrey W.

Faraday Discuss ; 213(0): 371-391, 2019 02 18.

Artigo em Inglês | MEDLINE | ID: mdl-30357183

RESUMO

Hardware accelerators based on two-terminal non-volatile memories (NVMs) can potentially provide competitive speed and accuracy for the training of fully connected deep neural networks (FC-DNNs), with respect to GPUs and other digital accelerators. We recently proposed [S. Ambrogio et al., Nature, 2018] novel neuromorphic crossbar arrays, consisting of a pair of phase-change memory (PCM) devices combined with a pair of 3-Transistor 1-Capacitor (3T1C) circuit elements, so that each weight was implemented using multiple conductances of varying significance, and then showed that this weight element can train FC-DNNs to software-equivalent accuracies. Unfortunately, however, real arrays of emerging NVMs such as PCM typically include some failed devices (e.g., <100% yield), either due to fabrication issues or early endurance failures, which can degrade DNN training accuracy. This paper explores the impact of device failures, NVM conductances that may contribute read current but which cannot be programmed, on DNN training and test accuracy. Results show that "stuck-on" and "dead" devices, exhibiting high and low read conductances, respectively, do in fact degrade accuracy performance to some degree. We find that the presence of the CMOS-based and thus highly-reliable 3T1C devices greatly increase system robustness. After studying the inherent mechanisms, we study the dependence of DNN accuracy on the number of functional weights, the number of neurons in the hidden layer, and the number and type of damaged devices. Finally, we describe conditions under which making the network larger or adjusting the network hyperparameters can still improve the network accuracy, even in the presence of failed devices.

Equivalent-accuracy accelerated neural-network training using analogue memory.

Ambrogio, Stefano; Narayanan, Pritish; Tsai, Hsinyu; Shelby, Robert M; Boybat, Irem; di Nolfo, Carmelo; Sidler, Severin; Giordano, Massimo; Bodini, Martina; Farinha, Nathan C P; Killeen, Benjamin; Cheng, Christina; Jaoudi, Yassine; Burr, Geoffrey W.

Nature ; 558(7708): 60-67, 2018 06.

Artigo em Inglês | MEDLINE | ID: mdl-29875487

RESUMO

Neural-network training can be slow and energy intensive, owing to the need to transfer the weight data for the network between conventional digital memory chips and processor chips. Analogue non-volatile memory can accelerate the neural-network training algorithm known as backpropagation by performing parallelized multiply-accumulate operations in the analogue domain at the location of the weight data. However, the classification accuracies of such in situ training using non-volatile-memory hardware have generally been less than those of software-based training, owing to insufficient dynamic range and excessive weight-update asymmetry. Here we demonstrate mixed hardware-software neural-network implementations that involve up to 204,900 synapses and that combine long-term storage in phase-change memory, near-linear updates of volatile capacitors and weight-data transfer with 'polarity inversion' to cancel out inherent device-to-device variations. We achieve generalization accuracies (on previously unseen data) equivalent to those of software-based training on various commonly used machine-learning test datasets (MNIST, MNIST-backrand, CIFAR-10 and CIFAR-100). The computational energy efficiency of 28,065 billion operations per second per watt and throughput per area of 3.6 trillion operations per second per square millimetre that we calculate for our implementation exceed those of today's graphical processing units by two orders of magnitude. This work provides a path towards hardware accelerators that are both fast and energy efficient, particularly on fully connected neural-network layers.

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA