IBM may be a bit of a latecomer in AI acceleration, but it has a loyal following of System z mainframe and Power Systems servers. Many of these customers, spending millions on application and database servers, want a more native approach to AI inference and training. That's why IBM has an opportunity to make a little money on hardware with its own systems base, despite the overwhelming dominance of Nvidia GPUs and their AI enterprise software stacks across the market.
At the Hot Chips 2024 conference, Chris Berry, IBM's distinguished engineer in microprocessor design, discussed the AI acceleration that will be built into the next generation of mainframe processors. The processor is codenamed “Telum II” and will likely be sold as the z17. It is the successor to the “Telum” processor announced at Hot Chips 2021 and known as the z16. Berry revealed some of the details of the Telum II chip, including the on-die DPU, revealing that IBM will indeed be commercializing an AI Acceleration Unit (AIU), which IBM Research has been developing for years and will talk more about in October 2022.
At the time, we advised Big Blue that the AIU shouldn't be a science project. And apparently not, as it will be commercialized as an AI acceleration card for the System z17 mainframe shipping next year. The second-generation AIU, now known as the Spyre accelerator, provides much more power for larger AI models than the on-chip AI unit in Telum and Telum II processors can support. It also enables AI acceleration within the security perimeter of the mainframe's PCI-Express bus, a concern for mainframe shops.
The fanaticism about security is understandable, given the workloads mainframes have handled for decades: Big Blue estimates that 70% of all financial transactions on the planet pass through mainframes, including at banks, trading companies, insurance companies, and healthcare companies.
There are reasons for this, and one of them is not because mainframes are cheap. Of course they are not cheap. But mainframes are nearly indestructible, extremely stable, high I/O systems that can run at 98-99 percent utilization continuously, and achieve 99.999999 percent uptime (which equates to 1 hour of unplanned downtime in 11,400 years). In addition, these customers have home-grown applications, typically written in COBOL, but more often written in Java these days, that they want to run on the mainframe using code assistants like those IBM created for its Watson.x AI platform, and write the weights of their AI models and the resulting applications on the mainframe. Again, this is for security reasons. They do not offload their core banking software or the AI models that drive it to the cloud.
Telum II z17 processor
The Telum II processor has 43 billion transistors and, like the Telum chip, eight very fat cores doing the card-wracking work common in the back-office systems that represented the first wave of computing in data centers. The Telum z16 processor had 22.5 billion transistors in 530 mm2 and was implemented on Samsung's 7-nanometer process. With Telum II, IBM has nearly doubled the transistor count to 43 billion and expanded the chip area to 600 mm2 by shrinking to Samsung's 5-nanometer process (5HPP to be exact). Many of these extra transistors will go to a more powerful L2 cache, but some are expected to be allocated to the on-chip DPU and larger on-chip AI accelerators, which are part of the circuitry derived from the AIU.
The DPU is in the center on the left, and there is some dead area in the chip because, in Berry's words, the DPU design was “a little smaller” than IBM expected. To our eyes, the DPU only takes up the space of 1.6 cores instead of the two that IBM had planned. It appears that IBM realized that by sacrificing two z17 cores to replace them with the integrated DPU and accelerating and streamlining the I/O, they could increase the effective performance of the entire computing complex, even though the core count remained the same and the clock speed was only about 10 percent faster than the 5.5 GHz of the z16 cores. IPC and other improvements resulted in a 20 percent performance increase per socket compared to the z16.
Like the z16, the z17 does away with the L3 and L4 caches common in previous System z processors, and instead presents part of the L2 cache as a shared L3 or L4 cache, depending on what software requires. It appears that some of the increase in transistor count in the Telum II processor goes to cache. The Telum II has 10 L2 caches, each 36 MB in size. The original Telum chip from two years ago had eight 32 MB caches.
The Telum II has a 360 MB virtual L3 cache, which is a 50 percent increase from the Telum chip's 240 MB virtual L3 cache. The Telum II's 2.88 GB virtual L4 cache is 40 percent larger than the L4 cache of the first-generation Telum chip. The virtual L4 cache appears to be a partition of the DDR5 main memory in the system. This main memory is up to 16 TB across a four-socket drawer with two Telum II chips per socket. (This is a 60 percent increase in main memory capacity compared to the z16 drawer.) This main memory implements the OpenCAPI memory interface, just like the z16 and Power10 processors. It will likely use the DDR5 memory chips just announced as an upgrade for the Power10 processor, rather than the DDR4 memory used in the z16 and originally in the Power10 systems.
The z16 and z17 systems scale from one to four drawers and have 16 sockets, 32 chiplets, 256 cores, and 64 TB of main memory. The systems also have 192 PCI-Express 5.0 slots housed in 12 I/O expansion drawers with 16 slots each (for disk drives, flash drives, crypto processors, or Spyre accelerator cards for AI).
The data processing unit goes full circle
“Mainframes process massive amounts of data,” Berry explained during his Hot Chips presentation. “A single fully configured IBM z16 can process 25 billion cryptographic transactions every day – more cryptographic transactions in a single day than Google searches, Facebook posts, and tweets combined. This kind of scale requires I/O capabilities far beyond what general-purpose computing systems can provide. It requires custom I/O protocols that minimize latency, support the virtualization of thousands of operating system instances, and can handle tens of thousands of outstanding I/O requests at any given time.”
“So we decided to leverage the DPU to implement these custom I/O protocols. Considering all the communication between the processor and the I/O subsystem, we decided to place the DPU directly on the processor chip, rather than connecting it behind the PCI bus. We gave the DPU its own L2 cache by coherently connecting it to the processor SMP fabric and placing it on the processor side of the PCI interface, enabling coherent communication between the DPU and the main processors that run the key enterprise workloads. We minimized the communication latency, improved performance and power efficiency, and reduced the I/O management power of the entire system by more than 70%.”
The Telum II DPU has four clusters of four cores, each with 32KB of L1 data cache and 32KB of L1 instruction cache. Details of these cores have not been disclosed, but they could be IBM's own Power cores (likely lightweight) or Arm cores (the former rather than the latter). The DPU connects to one of the 36MB L2 cache segments, meaning there is a spare 36MB L2 cache that is not connected to any specific core or DPU. The DPU has a pair of PCI-Express 5.0 x16 interfaces that are linked to a pair of PCI-Express 5.0 controllers on the Telum II die that link to the I/O expansion drawer.
The on-chip AI accelerator is located at the bottom left of the chip, and is roughly the same area as one of the z17 cores, but flattened. This architecture is speculated to be an improved version of the AI accelerator built into the same location on the first-generation Telum chip. According to Berry, IBM added INT8 data types to the existing FP16 supported by the first-generation AI accelerator on the Telum chip, and also made the on-chip AI accelerator shareable between all Telum II chips in the z17 system via the XBus and Abus NUMA interconnects implemented on the die to create a shared memory system.
Telum II's on-chip AI accelerator has a performance of 24 teraops per second (TOPS). There is no rating for the Telum AI accelerator from three years ago. Today, customers can expect 192 TOPS per z17 drawer and 768 TOPS across a complete z17 system.
But that's not all: Now that Spyre is commercially available, customers can put a Spyre accelerator in their I/O drawer to unleash even more of the power of AI.
Spreading Spire
The Spyre chip is implemented on a low-power version of Samsung's 5-nanometer process (5LPE) and packs 26 billion transistors into an area of 330 mm2.
The original AIU, which IBM ran as a research project and made headlines two years ago, had 32 cores very similar to the AI accelerator in its Telum processors, 32 of which were exposed as usable for yield purposes, all implemented on a Samsung 5-nanometer process with 23 million transistors. Spyre appears to be an improved version of this AIU chip, with 32 cores and 2MB of scratchpad memory per core.
Below is a block diagram of the Spyre core.
The Spyre chip has a 32-byte bidirectional ring connecting the 32 cores (34 cores to be exact, but only 32 are active), and another 128-byte ring connecting the scratchpad memory that comes with the cores. The cores support INT4, INT8, FP8, and FP16 data types.
The Spyre accelerator card looks like this:
The Spyre card has 128 GB of LPDDR5 memory arranged in eight banks. This is much more than the 48 GB in the original AIU, delivering over 300 TOPS (presumably at FP16 resolution) within a 75 watt envelope. The Spyre card plugs into a PCI-Express 5.0 x16 slot. The LPDDR5 memory is connected to a scratchpad memory ring, providing 200 GB/s of memory bandwidth to the ring.
Clustering eight Spyre cards together in an I/O drawer, the maximum number recommended by IBM, creates a virtual Spyre card with 1 TB of memory and 1.6 TB/s of memory bandwidth to run AI models, giving a total of over 3 petaop performance (likely at FP16 resolution). Ten such drawers would give you 10 TB of memory and 16 TB/s of total bandwidth for 30 petaop AI power.
This is enough for IBM mainframe shops to run serious AI within their applications and databases, and within the security perimeter of the “mainframe” of the z17 complex.
The Spyre cards are expected to ship next year, likely around the same time as the z17 mainframes ship; Berry didn't give an exact date but said they're currently in technology preview, meaning select customers can get them now.
One final thing: there's nothing about the accelerators in the Telum II processor or the Spyre accelerator that specifically ties them to the IBM System z mainframe. In fact, the same approach that enables programming tools and compilers to see both on-chip AI accelerators and external Spyre accelerators as if they were native instructions on the z17 processor can also be implemented, for example, on IBM's future Power11 processor, which is also due to ship next year.
Come to think of it, why wasn't IBM's Power11 processor announced at Hot Chips 2024 this week? Given that it's due to be released next year, we were hoping to see some new information about the Power11 at the conference, as has been customary in the past. Perhaps we'll hear more at the ISSCC 2025 conference early next year. The Power11 chip is expected to be released in the middle of next year.
Sign up for our newsletter
We'll deliver the week's highlights, analysis and stories straight to your inbox, with no middle ground.
Subscribe now
Related articles