Editor's note: This post is part of our “AI Decoded” series, which demystifies AI by making the technology more accessible and introducing new hardware, software, tools, and acceleration for RTX PC users.
Large-scale language models, with their ability to rapidly understand, summarize, and generate text-based content, are driving some of the most exciting developments in AI.
These features are useful for a variety of use cases, including productivity tools, digital assistants, non-playing characters in video games, etc. However, they are not a one-size-fits-all solution and developers often need to tweak LLM to fit the needs of their application.
The NVIDIA RTX AI Toolkit makes it easy to fine-tune and deploy AI models on RTX AI PCs and workstations through a technique called Low-Rank Adaptation (LoRA). A new update released today enables the simultaneous use of multiple LoRA adapters within the NVIDIA TensorRT-LLM AI acceleration library, improving performance of fine-tuned models by up to 6x.
Optimize performance
To achieve higher performance and meet increasing user demands, LLMs need to be carefully customized.
These foundational models are trained on vast amounts of data, but often lack the context needed for a developer's specific use case: a typical LLM can generate video game dialogue, for example, but it may lack the nuance and subtlety needed to write in the style of a forest elf with a dark past and a barely hidden disdain for authority.
To achieve a more customized output, developers can fine-tune the model with information relevant to their app's use case.
Let's take an example of developing an app that uses LLM to generate in-game dialogue. The fine-tuning process starts with using pre-trained model weights, such as information about what characters say in the game. To get the right style of dialogue, developers can tune the model on a smaller sample dataset, such as dialogue written in a creepier or villainous tone.
Sometimes developers want to run all these different fine-tuning processes at the same time: for example, they might need to generate marketing copy written in different voices for different content channels, while simultaneously summarizing a document and making style suggestions, or creating scene descriptions for a video game and image prompts for a text-to-image generator.
Running multiple models simultaneously is not practical because they cannot all fit into the GPU memory at the same time, and even if they did, the inference time would be dominated by memory bandwidth – how quickly data can be loaded from memory to the GPU.
Behold,
A common way to address these issues is to use fine-tuning techniques such as low-rank adaptation, which can be thought of simply as a patch file that contains customizations made through the fine-tuning process.
Once trained, customized LoRA adapters seamlessly integrate with the underlying model during inference, adding minimal overhead. Developers can plug adapters into a single model to serve multiple use cases, keeping the memory footprint low while providing the additional detail required for each specific use case.
An architectural overview of how multi-LoRA capabilities can be used to support multiple clients and use cases in a single base model
In practice, this means that an app can use multiple LoRA adapters in parallel with numerous customizations while keeping only one copy of the base model in memory.
This process is called multi-LoRA serving. When multiple calls are made to a model, the GPU can process all the calls in parallel, maximizing Tensor Core usage and minimizing memory and bandwidth demands, allowing developers to efficiently use AI models in their workflow. Models fine-tuned using multi-LoRA adapters can run up to 6x faster.
LLM inference performance on GeForce RTX 4090 desktop GPU on Llama 3B int4 with LoRA adapter applied at runtime. Input sequence length is 1,000 tokens and output sequence length is 100 tokens. The maximum rank of the LoRA adapter is 64.
In the in-game dialogue application example mentioned above, multi-LoRA serving can be used to extend the scope of the app, generating both story elements and illustrations in a single prompt.
Users input a basic story idea, and the LLM fleshes out the concept, expanding on the idea and providing a detailed foundation. The application then refines the story and generates corresponding images using the same model powered by two different LoRA adapters. One LoRA adapter generates Stable Diffusion prompts for creating visuals using a locally deployed Stable Diffusion XL model, while the other LoRA adapter, fine-tuned for story creation, allows for the creation of well-structured and compelling narratives.
In this case, the same model is used for both inference passes, so the process does not require much more space. The second pass, which involves generating both text and images, is performed using batch inference, making the process extremely fast and efficient on NVIDIA GPUs. This allows users to quickly iterate on different versions of a story and easily refine the narrative and illustrations.
This process is explained in more detail in a recent tech blog.
LLMs are becoming one of the most critical components of modern AI. As adoption and integration continues to grow, the demand for powerful, fast LLMs with application-specific customizations will only grow. With multi-LoRA support added to the RTX AI Toolkit today, developers now have a powerful new way to accelerate these features.