OpenAI Hot Chips 2024_Page_19
At Hot Chips 2024, OpenAI will be giving a one-hour keynote about building a scalable AI infrastructure. This makes a lot of sense, as OpenAI as an organization uses a lot of compute and will likely use even more in the coming years.
We're live on Hot Chips 2024 this week so please excuse any typos.
OpenAI Keynote on Building Scalable AI Infrastructure
I'm sure most of you are familiar with ChatGPT and OpenAI and how LLM works, so I'll just show you the next few slides since I'm sure you're already up to speed.
OpenAI Hot Chips 2024_Page_03 OpenAI Hot Chips 2024_Page_04 OpenAI Hot Chips 2024_Page_05
From a scale perspective, GPT-1 was cool in 2018. GPT-2 was more consistent. GPT-3 had in-context learning. GPT-4 is actually useful. Future models are expected to be even more useful with new behaviors.
OpenAI Hot Chips 2024_Page_06
The big observation is that scaling up produces better, more useful AI.
OpenAI Hot Chips 2024_Page_07
The question was, how would OpenAI know if training a larger model would produce a better model? OpenAI observed that every time they doubled their computing power, they got better results. The graph below shows a four-order of magnitude increase in computing power, yet the scaling still worked.
OpenAI Hot Chips 2024_Page_08
OpenAI looked at tasks like coding and found that a similar pattern applies, and because this is done on an average logarithmic scale, pass/fail isn't overly weighted towards solving easy coding problems.
OpenAI Hot Chips 2024_Page_09
This is the MMLU benchmark, which is an attempt to be the gold standard for machine learning benchmarks, but with logarithmic progress, GPT-4 was already scoring around 90% on the test.
OpenAI Hot Chips 2024_Page_10
Here is a graph of the industry compute used to train various frontier models, which has grown roughly 4x annually since 2018.
OpenAI Hot Chips 2024_Page_13
GPT-1 was just a box for a few weeks, but it has now scaled up to use huge GPU clusters.
OpenAI Hot Chips 2024_Page_14
In 2018, the growth rate of computing increased from 6-7x annually to 4x. Many of the low-hanging fruit were likely solved in 2018. Going forward, issues such as cost and power will become much bigger challenges.
OpenAI Hot Chips 2024_Page_15
On the inference side, demand is driven by intelligence. Most of the inference computation is used for top-end models. Smaller models tend to require significantly less computation. Demand for inference GPUs is growing significantly.
OpenAI Hot Chips 2024_Page_16
Here are three key claims about AI computing:
OpenAI Hot Chips 2024_Page_17
It is believed that the world needs more AI infrastructure than we have planned.
OpenAI Hot Chips 2024_Page_18
The black line shows the actual solar demand. Here are the experts' predictions for demand. The line keeps going up, but the experts don't agree.
OpenAI Hot Chips 2024_Page_19
For nearly 50 years, Moore's Law has continued to rise and rise, longer than many people thought possible.
OpenAI Hot Chips 2024_Page_20
As a result, OpenAI believes that AI requires massive investment, with gains in computing power already delivering over eight orders of magnitude.
OpenAI says it needs to be designed for mass deployment. RAS is one example of this. Clusters get too big and experience hard and soft failures. Silent data corruption can occur that is not reproducible even if you can isolate the GPUs. Cluster failures have a wide impact.
OpenAI Hot Chips 2024_Page_22
OpenAI says repair costs need to come down, and the blast radius needs to be smaller so that the failure of one part means fewer other parts fail.
OpenAI Hot Chips 2024_Page_23
One idea is to use graceful degradation, very similar to what we do in our hosting clusters at STH, and without the need for technician time. Validation is also important in large environments.
OpenAI Hot Chips 2024_Page_24
Power becomes a big challenge because there is only so much power in the world, and GPUs all start and stop at the same time, which creates challenges with data center loads.
OpenAI Hot Chips 2024_Page_25
Just as we have learned important lessons, so too has OpenAI. Read on to find out.
OpenAI Hot Chips 2024_Page_26
It's interesting that everyone is so focused on performance, yet performance is only one of the four points.
Final Words
The scaling challenges and cluster-level challenges are huge. If you look at the Top500, a large AI cluster today is roughly the size of the top three or four systems on that list combined. It was interesting to see what major customers are saying about their AI hardware needs.