We’re approaching the computational limits of deep learning. That’s according to researchers at the Massachusetts Institute of Technology, Underwood International College, and the University of Brasilia, who found in a recent study that progress in deep learning has been “strongly reliant” on increases in compute. It’s their assertion that continued progress will require “dramatically” more computationally efficient deep learning methods, either through changes to existing techniques or via new as-yet-undiscovered methods.
“We show deep learning is not computationally expensive by accident, but by design. The same flexibility that makes it excellent at modeling diverse phenomena and outperforming expert models also makes it dramatically more computationally expensive,” the coauthors wrote. “Despite this, we find that the actual computational burden of deep learning models is scaling more rapidly than (known) lower bounds from theory, suggesting that substantial improvements might be possible.”
Deep learning is the subfield of machine learning concerned with algorithms inspired by the structure and function of the brain. These algorithms — called artificial neural networks — consist of functions (neurons) arranged in layers that transmit signals to other neurons. The signals, which are the product of input data fed into the network, travel from layer to layer and slowly “tune” the network, in effect adjusting the synaptic strength (weights) of each connection. The network eventually learns to make predictions by extracting features from the data set and identifying cross-sample trends.
The researchers analyzed 1,058 papers from the preprint server Arxiv.org as well as other benchmark sources to understand the connection between deep learning performance and computation, paying particular mind to domains including image classification, object detection, question answering, named entity recognition, and machine translation. They performed two separate analyses of computational requirements reflecting the two types of information available:
- Computation per network pass, or the number of floating-point operations required for a single pass (i.e. weight adjustment) in a given deep learning model.
- Hardware burden, or the computational capability of the hardware used to train the model, calculated as the number of processors multiplied by the computation rate and time. (The researchers concede that while it’s an imprecise measure of computation, it was more widely reported in the papers they analyzed than other benchmarks.)
The coauthors report “highly statistically significant” slopes and “strong explanatory power” for all benchmarks except machine translation from English to German, where there was little variation in the computing power used. Object detection, named-entity recognition, and machine translation in particular showed large increases in hardware burden with relatively small improvements in outcomes, with computational power explaining 43% of the variance in image classification accuracy on the popular open source ImageNet benchmark.
The researchers estimate that three years of algorithmic improvement is equivalent to a 10 times increase in computing power. “Collectively, our results make it clear that, across many areas of deep learning, progress in training models has depended on large increases in the amount of computing power being used,” they wrote. “Another possibility is that getting algorithmic improvement may itself require complementary increases in computing power.”
In the course of their research, the researchers also extrapolated the projections to understand the computational power needed to hit various theoretical benchmarks, along with the associated economic and environmental costs. According to even the most optimistic of calculation, reducing the image classification error rate on ImageNet would require 105 more computing.
To their point, a Synced report estimated that the University of Washington’s Grover fake news detection model cost $25,000 to train in about two weeks. OpenAI reportedly racked up a whopping $12 million to train its GPT-3 language model, and Google spent an estimated $6,912 training BERT, a bidirectional transformer model that redefined the state of the art for 11 natural language processing tasks.
In a separate report last June, researchers at the University of Massachusetts at Amherst concluded that the amount of power required for training and searching a certain model involves the emissions of roughly 626,000 pounds of carbon dioxide. That’s equivalent to nearly five times the lifetime emissions of the average U.S. car.
“We do not anticipate that the computational requirements implied by the targets … The hardware, environmental, and monetary costs would be prohibitive,” the researchers wrote. “Hitting this in an economical way will require more efficient hardware, more efficient algorithms, or other improvements such that the net impact is this large a gain.”
The researchers note there’s historical precedent for deep learning improvements at the algorithmic level. They point to the emergence of hardware accelerators like Google’s tensor processing units, field-programmable gate arrays (FPGAs), and application-specific integrated circuits (ASICs), as well as attempts to reduce computational complexity through network compression and acceleration techniques. They also cite neural architecture search and meta learning, which use optimization to find architectures that retain good performance on a class of problems, as avenues toward computationally efficient methods of improvement.
Indeed, an OpenAI study suggests that the amount of compute needed to train an AI model to the same performance on classifying images in ImageNet has been decreasing by a factor of 2 every 16 months since 2012. Google’s Transformer architecture surpassed a previous state-of-the-art model — seq2seq, which was also developed by Google — with 61 times less compute three years after seq2seq’s introduction. And DeepMind’s AlphaZero, a system that taught itself from scratch how to master the games of chess, shogi, and Go, took eight times less compute to match an improved version of the system’s predecessor, AlphaGoZero, one year later.
“The explosion in computing power used for deep learning models has ended the ‘AI winter’ and set new benchmarks for computer performance on a wide range of tasks. However, deep learning’s prodigious appetite for computing power imposes a limit on how far it can improve performance in its current form, particularly in an era when improvements in hardware performance are slowing,” the researchers wrote. “The likely impact of these computational limits is forcing … machine learning towards techniques that are more computationally-efficient than deep learning.”