To receive industry-leading AI updates and exclusive content, sign up for our daily and weekly newsletters. Learn more
Many companies are hopeful that AI will revolutionize their business, but the sheer cost of training advanced AI systems can quickly dash those hopes. Elon Musk has noted that engineering issues often cause progress to stall, especially when it comes to optimizing hardware like GPUs to efficiently handle the massive computational requirements of training and fine-tuning large language models.
While large technology companies can afford to spend millions, and sometimes billions, of dollars on training and optimization, smaller companies and startups are often left behind due to lack of funding in the short term. In this article, we discuss some strategies that may allow developers with limited resources to train AI models without incurring large costs.
If you're going to invest for 10 cents, invest for a dollar.
As we know, the creation and release of AI products, whether foundational models/large language models (LLMs) or fine-tuned downstream applications, rely heavily on specialized AI chips, specifically GPUs. These GPUs are very expensive and hard to obtain, which is why SemiAnalysis coined the terms “GPU rich” and “GPU poor” within the machine learning (ML) community. Training LLMs can be costly, primarily due to expenses associated with the hardware (including both acquisition and maintenance), rather than the ML algorithms or expertise.
Training these models requires extensive computations on powerful clusters, and the larger the model, the longer it takes. For example, training LLaMA 2 70B requires exposing 70 billion parameters to 2 trillion tokens, which requires at least 10^24 floating point operations. If you have a poor GPU, should you give up? No.
Alternative strategies
There are several strategies that technology companies are currently leveraging to find alternative solutions, reduce reliance on expensive hardware, and ultimately save costs.
One approach is to tune and streamline training hardware. This approach is still largely experimental and investment intensive, but it holds promise for future optimization of LLM training. Examples of such hardware-related solutions include custom AI chips from Microsoft and Meta, Nvidia and OpenAI's new semiconductor initiative, Baidu's single compute cluster, Vast's rental GPUs, and Etched's Sohu chips.
While this is an important step forward, this methodology is best suited for larger companies that can afford to invest heavily now to reduce future expenses, not for new entrants with limited funds who want to develop an AI product now.
What to do: Innovative software
With a low budget in mind, there is another way to optimize your LLM training and reduce costs through innovative software. This approach is more affordable and accessible to most ML engineers, whether they are seasoned professionals, AI enthusiasts, or software developers looking to enter the field. Let's take a closer look at some of these code-based optimization tools.
Mixed Precision Training
Summary: Imagine your company has 20 employees, but rents office space for 200. Clearly, this is a waste of resources. Similar inefficiencies occur in practice during model training, where ML frameworks often allocate more memory than is actually needed. Mixed precision training fixes this through optimizations, improving both speed and memory usage.
How it works: To achieve this, lower precision b/float16 arithmetic is combined with standard float32 arithmetic to perform fewer computational operations at one time. This may sound like a ball of technical arcana to non-engineers, but essentially it means that AI models can process data faster and require less memory without compromising accuracy.
Metrics of Improvement: This technique can improve execution time by up to 6x on GPUs and 2-3x on TPUs (Google's Tensor Processing Units). Open-source frameworks such as Nvidia's APEX and Meta AI's PyTorch support mixed precision training and are easily available for pipeline integration. By implementing this technique, companies can significantly reduce GPU costs while maintaining acceptable model performance.
Activation Checkpoint
Summary: If you have limited memory but at the same time want more time, checkpointing is the right technique. In short, it significantly reduces memory consumption by minimizing computation, making LLM training possible without hardware upgrades.
How it works: The main idea of activation checkpointing is to store a subset of important values while training your model, and recalculate the rest only when necessary. This means that instead of keeping all intermediate data in memory, the system only keeps what's important, freeing up memory space in the process. This is similar to the “cross that bridge when the time comes” principle, meaning that you don't bother with less urgent issues until they require your attention.
Improvement Metrics: In most cases, activation checkpointing reduces memory usage by up to 70%, but also extends the training phase by about 15-25%. This fair tradeoff allows companies to train large AI models on existing hardware without investing additional capital in infrastructure. The aforementioned PyTorch library supports checkpointing, making it easier to implement.
Multi-GPU Training
Summary: Imagine a small bakery that needs to produce a large number of baguettes quickly. With one baker working alone, it will probably take a long time. With a second baker, the process speeds up. Adding a third baker speeds it up even more. Multi-GPU training works in much the same way.
How it works: Instead of using one GPU, we use multiple GPUs simultaneously, so that the training of the AI model is distributed across these GPUs, which can run in parallel with each other. Logically, this is the opposite of checkpointing, a previous method that reduced hardware acquisition costs at the expense of extending execution times. Here, we use more hardware, but utilize it to the fullest and maximize efficiency, thereby reducing execution times and reducing operational costs in return.
Improvement metrics: Below are three robust tools for training LLM on a multi-GPU setup, listed in ascending order of efficiency based on experimental results.
DeepSpeed: A library designed specifically for training AI models using multiple GPUs, which can achieve speeds up to 10x faster than traditional training approaches. FSDP: One of the most popular frameworks for PyTorch, which addresses some of DeepSpeed's inherent limitations and provides an additional 15-20% increase in computational efficiency. YaFSDP: A recently released enhanced version of FSDP for model training, which provides speedups of 10-25% over the original FSDP methodology.
Conclusion
Using techniques such as mixed-precision training, activation checkpoints, and the use of multiple GPUs, even small and medium-sized businesses can make significant advances in AI training, both fine-tuning and creating models. These tools increase computational efficiency, speeding up execution times and reducing overall costs. Additionally, they allow larger models to be trained on existing hardware, reducing the need for expensive upgrades. By democratizing access to advanced AI capabilities, these approaches enable a wider range of technology companies to innovate and compete in this rapidly evolving field.
There is a saying that “AI will never replace you, but someone using AI will replace you.” The time to embrace AI is now, and with the strategies above, you can do so even on a budget.
Ksenia Se is the founder of Turing Post.
Data Decision Maker
Welcome to the VentureBeat community!
DataDecisionMakers is a place where experts, including technologists working with data, can share data-related insights and innovations.
If you want to hear about cutting edge ideas, updates, best practices, and the future of data and data technology, join DataDecisionMakers.
You might also consider contributing your own article.
Learn more about DataDecisionMakers