OpenAI to serve ChatGPT on Cerebras' AI dinner plates
Briefly

OpenAI to serve ChatGPT on Cerebras' AI dinner plates
"By integrating Cerebras' wafer-scale compute architecture into its inference pipeline, OpenAI can take advantage of the chip's massive SRAM capacity to speed up inference. Each of the chip startup's WSE-3 accelerators measures in at 46,225 mm2 and is equipped with 44 GB of SRAM. Compared to the HBM found on modern GPUs, SRAM is several orders of magnitude faster. While a single Nvidia Rubin GPU can deliver around 22 TB/s of memory bandwidth, Cerebras' chips achieve nearly 1,000x that at 21 Petabytes a second."
"All that bandwidth translates into extremely fast inference performance. Running models like OpenAI's gpt-oss 120B, Cerebras' chips can purportedly achieve single user performance of 3,098 tokens a second as compared to 885 tok/s for competitor Together AI, which uses Nvidia GPUs. In the age of reasoning models and AI agents, faster inference means models can "think" for longer without compromising on interactivity."
""Integrating Cerebras into our mix of compute solutions is all about making our AI respond much faster. When you ask a hard question, generate code, create an image, or run an AI agent, there is a loop happening behind the scenes: you send a request, the model thinks, and it sends something back," OpenAI explained in a recent blog post. "When AI responds in real time, users do more with it, stay longer, and run higher-value workloads.""
OpenAI will deploy 750 megawatts of Cerebras wafer-scale accelerators through 2028 to bolster inference services. The deal is valued at more than $10 billion and assigns Cerebras the risk of building and leasing datacenters to serve OpenAI. Cerebras' WSE-3 accelerators measure 46,225 mm2 and include 44 GB of SRAM each, delivering roughly 21 PB/s of memory bandwidth versus about 22 TB/s for a single Nvidia Rubin GPU. That bandwidth yields much faster single-user inference (3,098 tok/s on gpt-oss 120B versus 885 tok/s for a GPU setup). SRAM is less space efficient, so memory capacity remains modest compared with some GPU designs, creating limits for very large models.
Read at Theregister
Unable to calculate read time
[
|
]