To receive industry-leading AI updates and exclusive content, sign up for our daily and weekly newsletters. Learn more
The Transformer architecture powers today's most popular public and private AI models. So what's next? Is this the architecture that leads to better inference? What's next for Transformers? Today, to build intelligence into models, they require large amounts of data, GPU computing power, and rare talent. As such, models are typically expensive to build and maintain.
The adoption of AI began with making simple chatbots more intelligent. Now, startups and large enterprises have figured out how to package intelligence in the form of a co-pilot that augments human knowledge and skills. The next natural step is to package multi-step workflows, memory, personalization, and more in the form of an agent that can solve use cases for multiple functions, including sales and engineering. The hope is that with a simple prompt from the user, the agent will be able to classify the intent and break down the goal into multiple steps to complete the task, whether that be searching the internet, authenticating to multiple tools, or learning from past repetitive behavior.
When we apply these agents to consumer use cases, we see a future where everyone has a personal agent like Jarvis on their phone that understands them. Want to book a trip to Hawaii, order food at your favorite restaurant, or manage your personal finances? A future where you can use a personalized agent to securely manage these tasks is possible, but from a technology perspective, we are still a long way from that future.
Is transformer architecture the final frontier?
The self-attention mechanism in the Transformer architecture allows the model to evaluate the importance of each input token simultaneously with respect to all tokens in the input sequence. This improves the model's language and computer vision understanding by capturing long-distance dependencies and complex token relationships. However, long sequences (e.g. DNA) increase computational complexity, resulting in poor performance and high memory consumption. Solutions and research approaches to solving the long sequence problem include:
Improving Transformer on Hardware: A promising technique here is FlashAttention. The paper claims that Transformer performance can be improved by carefully managing reads and writes to different levels of fast and slow memory on the GPU. This is achieved by making the attention algorithm IO-aware, reducing the number of reads/writes between the GPU's high-bandwidth memory (HBM) and static random access memory (SRAM). Approximate Attention: The self-attention mechanism has complexity O(n^2), where n represents the length of the input sequence. Is there a way to linearly reduce this quadratic computational complexity, allowing the Transformer to handle longer sequences better? Optimizations here include techniques such as Reformer, Performer, and Skyformer.
In addition to these optimizations to reduce the Transformer's complexity, several alternative models are challenging the Transformer's dominance (although most are still in their infancy).
State-space models: These are a class of models related to recurrent (RNN) and convolutional (CNN) neural networks that compute with linear or near-linear computational complexity for long sequences. State-space models (SSMs) like Mamba can handle long-distance relationships better, but lag behind Transformers in terms of performance.
These research approaches have now left university labs and are available in the public domain in the form of new models for anyone to try. Additionally, the latest model releases inform the state of the underlying technology and viable paths for alternatives to Transformer.
Featured model release
The latest and greatest models continue to be released from regular contributors like OpenAI, Cohere, Anthropic, Mistral, etc. Meta's underlying models for compiler optimization are notable for the effectiveness of code and compiler optimizations.
In addition to the mainstream Transformer architecture, production-grade State Space Models (SSM), Hybrid SSM Transformer models, Mixture of Experts (MoE), and Composition of Experts (CoE) models are now emerging that appear to perform better across multiple benchmarks when compared to state-of-the-art open source models.
Databricks Open Source DBRX Model: This MoE model has 132B parameters. It has 16 experts, of which only 4 are active at the same time during inference or training. It supports 32K context windows and the model was trained with 12T tokens. Other interesting details are that it took 3 months, $10M, and 3,072 Nvidia GPUs connected with 3.2Tbps InfiniBand to complete pre-training, post-training, evaluation, red teaming, and model refinement. SambaNova Systems' Samba CoE v0.2 release: This CoE model consists of 5 7B parameter experts, of which only 1 is active during inference. All experts are open source models, and along with the experts, the model has a router, which tells it which model is best for a given query and routes the request to that model. It's very fast and generates 330 tokens per second. AI21 Labs has released Jamba, a hybrid Transformer-Mamba MoE model. It is the first production-grade Mamba-based model with elements of the traditional Transformer architecture. “The Transformer model has two shortcomings. First, its high memory and computing requirements prevent it from handling long contexts, making the key-value (KV) cache size a limiting factor. Second, since there is no single summary state, each generated token performs calculations across the entire context, slowing down inference and reducing throughput.” SSMs such as Mamba can handle long-distance relationships better, but they lag behind Transformers in performance. Jamba compensates for the inherent limitations of the pure SSM model, providing a 256K context window and fitting 140K contexts on a single GPU.
Challenges for corporate adoption
While there is great excitement surrounding the latest research and modeling that supports Transformer architecture as the next frontier, we must also consider the technical challenges that are preventing companies from realizing its benefits.
Frustration due to lack of enterprise features: Imagine selling to a CXO without simple features like role-based access control (RBAC), single sign-on (SSO), access to logs (both prompt and output). Today’s model may not be enterprise-ready, but enterprises are creating budgets to ensure they don’t miss out on the next big thing. Breaking what used to work: AI copilots and agents complicate securing data and applications. Imagine a simple use case. The video conferencing app you use every day introduces an AI summarization feature. As a user, you might love the ability to get a transcript after the meeting, but in a highly regulated industry, this extension could suddenly become a nightmare for your CISO. In effect, something that previously worked fine is now broken and must undergo additional security review. Enterprises need to put guardrails in place to ensure data privacy and compliance when such features are introduced in SaaS apps. The constant RAG vs. Tweak battle: It’s possible to deploy both at the same time or neither without sacrificing too much. Search Augmentation Generation (RAG) can be thought of as a way to ensure that facts are presented correctly and information is up to date, while fine-tuning can be thought of as what results in the best model quality. Fine-tuning is difficult, which is why some model vendors discourage it. It also includes challenges with overfitting, which negatively impacts model quality. Fine-tuning seems to be under pressure from multiple sides. As the model context window expands and token costs fall, RAG could become a better deployment option for enterprises. In the context of RAG, the recently released Command R+ model from Cohere is the first open-weight model to beat GPT-4 in the chatbot space. Command R+ is a state-of-the-art RAG-optimized model designed to power enterprise-grade workflows.
I recently spoke with an AI leader at a major financial institution who claimed that the future doesn't belong to software engineers, but to creative English/Arts students who can create effective prompts. There may be some truth in this comment; with a quick sketch and a multimodal model, even non-technical people can create simple applications without too much effort. Knowing how to use such tools is a superpower and can be useful to anyone looking to succeed in their career.
The same is true for researchers, practitioners, and founders. Today, there are multiple architectures to choose from to make the underlying model cheaper, faster, and more accurate. Today, there are many ways to modify the model for a specific use case, including fine-tuning techniques and new breakthroughs such as Direct Preference Optimization (DPO), an algorithm that can be considered as an alternative to Reinforcement Learning with Human Feedback (RLHF).
There's a lot of rapid change happening in the field of generative AI, and it can feel overwhelming for founders and buyers alike to prioritize. I'm excited to see what those building something new come up with next.
Ashish Kakran is a Principal at Thomvest Ventures focused on investing in early stage Cloud, Data/ML and Cybersecurity startups.
Data Decision Maker
Welcome to the VentureBeat community!
DataDecisionMakers is a place where experts, including technologists working with data, can share data-related insights and innovations.
If you want to hear about cutting edge ideas, updates, best practices, and the future of data and data technology, join DataDecisionMakers.
You might also consider contributing your own article.
Learn more about DataDecisionMakers