Google claims that its AI supercomputer is faster and greener than the Nvidia A100 chip.
Google stated that it did not compare its fourth generation chip to Nvidia’s current flagship H100 chip because the H100 was released after Google’s chip and uses newer technology.
Google said its supercomputers make it easy to reconfigure connections between chips on the fly, helping avoid problems and tweak for performance gains. (Image Source: Google)
Alphabet Inc’s Google revealed new details about the supercomputers it uses to train its artificial intelligence models on Tuesday, claiming the systems are both faster and more power-efficient than comparable Nvidia Corp systems.
Google created its own custom chip, known as the Tensor Processing Unit, or TPU. It uses those chips for more than 90% of the company’s artificial intelligence training, which is the process of feeding data through models to make them useful for tasks like responding to queries with human-like accuracy.
The Google TPU has reached the fourth generation. Google published a scientific paper on Tuesday detailing how it connected over 4,000 of the chips into a supercomputer using its own custom-developed optical switches to help connect individual machines.
Because the so-called large language models that power technologies like Google’s Bard or OpenAI’s ChatGPT have exploded in size, they are far too large to store on a single chip, improving these connections has become a key point of competition among companies that build AI supercomputers.
Instead, the models must be distributed across thousands of chips, which must then collaborate for weeks or months to train the model. Google’s PaLM model, the company’s largest publicly disclosed language model to date, was trained over 50 days by splitting it across two of the 4,000-chip supercomputers.
Google claims that its supercomputers make it simple to reconfigure connections between chips on the fly, thereby avoiding problems and optimising performance.
In a blog post about the system, Google Fellow Norm Jouppi and Google Distinguished Engineer David Patterson wrote, “Circuit switching makes it easy to route around failed components.” “We can even change the topology of the supercomputer interconnect to accelerate the performance of an ML (machine learning) model because of this flexibility.”
While Google is only now disclosing information about its supercomputer, it has been operational since 2020 in a data centre in Mayes County, Oklahoma. Google stated that the system was used by the startup Midjourney to train its model, which generates new images after being fed a few words of text.
According to the paper, Google’s chips are up to 1.7 times faster and 1.9 times more power-efficient than a system based on Nvidia’s A100 chip, which was on the market at the same time as the fourth-generation TPU.
A representative for Nvidia declined to comment.
Google stated that it did not compare its fourth-generation chip to Nvidia’s current flagship H100 chip because the H100 was introduced after Google’s chip and uses newer technology.
Google hinted at a new TPU to compete with the Nvidia H100 but provided no details, with Jouppi telling Reuters that Google has “a healthy pipeline of future chips.”