Tenstorrent has unveiled its next-generation Wormhole processor for AI workloads that promises to offer decent performance at a low price. The company currently offers two add-on PCIe cards carrying one or two Wormhole processors as well as TT-LoudBox, and TT-QuietBox workstations aimed at software developers. The whole of today's release is aimed at developers rather than those who will deploy the Wormhole boards for their commercial workloads.
“It is always rewarding to get more of our products into developer hands. Releasing development systems with our Wormhole™ card helps developers scale up and work on multi-chip AI software.” said Jim Keller, CEO of Tenstorrent. “In addition to this launch, we are excited that the tape-out and power-on for our second generation, Blackhole, is going very well.”
Each Wormhole processor packs 72 Tensix cores (featuring five RISC-V cores supporting various data formats) with 108 MB of SRAM to deliver 262 FP8 TFLOPS at 1 GHz at 160W thermal design power. A single-chip Wormhole n150 card carries 12 GB of GDDR6 memory featuring a 288 GB/s bandwidth.
Wormhole processors offer flexible scalability to meet the varying needs of workloads. In a standard workstation setup with four Wormhole n300 cards, the processors can merge to function as a single unit, appearing as a unified, extensive network of Tensix cores to the software. This configuration allows the accelerators to either work on the same workload, be divided among four developers or run up to eight distinct AI models simultaneously. A crucial feature of this scalability is that it operates natively without the need for virtualization. In data center environments, Wormhole processors will scale both inside one machine using PCIe or outside of a single machine using Ethernet.
From performance standpoint, Tenstorrent's single-chip Wormhole n150 card (72 Tensix cores at 1 GHz, 108 MB SRAM, 12 GB GDDR6 at 288 GB/s) is capable of 262 FP8 TFLOPS at 160W, whereas the dual-chip Wormhole n300 board (128 Tensix cores at 1 GHz, 192 MB SRAM, aggregated 24 GB GDDR6 at 576 GB/s) can offer up to 466 FP8 TFLOPS at 300W (according to Tom's Hardware).
To put that 466 FP8 TFLOPS at 300W number into context, let's compare it to what AI market leader Nvidia has to offer at this thermal design power. Nvidia's A100 does not support FP8, but it does support INT8 and its peak performance is 624 TOPS (1,248 TOPS with sparsity). By contrast, Nvidia's H100 supports FP8 and its peak performance is massive 1,670 TFLOPS (3,341 TFLOPS with sparsity) at 300W, which is a big difference from Tenstorrent's Wormhole n300.
There is a big catch though. Tenstorrent's Wormhole n150 is offered for $999, whereas n300 is available for $1,399. By contrast, one Nvidia H100 card can retail for $30,000, depending on quantities. Of course, we do not know whether four or eight Wormhole processors can indeed deliver the performance of a single H300, though they will do so at 600W or 1200W TDP, respectively.
In addition to cards, Tenstorrent offers developers pre-built workstations with four n300 cards inside the less expensive Xeon-based TT-LoudBox with active cooling and a premium EPYC-powered TT-QuietBox with liquid cooling.
Sources: Tenstorrent, Tom's Hardware
AIG.Skill on Tuesday introduced its ultra-low-latency DDR5-6400 memory modules that feature a CAS latency of 30 clocks, which appears to be the industry's most aggressive timings yet for DDR5-6400 sticks. The modules will be available for both AMD and Intel CPU-based systems.
With every new generation of DDR memory comes an increase in data transfer rates and an extension of relative latencies. While for the vast majority of applications, the increased bandwidth offsets the performance impact of higher timings, there are applications that favor low latencies. However, shrinking latencies is sometimes harder than increasing data transfer rates, which is why low-latency modules are rare.
Nonetheless, G.Skill has apparently managed to cherry-pick enough DDR5 memory chips and build appropriate printed circuit boards to produce DDR5-6400 modules with CL30 timings, which are substantially lower than the CL46 timings recommended by JEDEC for this speed bin. This means that while JEDEC-standard modules have an absolute latency of 14.375 ns, G.Skill's modules can boast a latency of just 9.375 ns – an approximately 35% decrease.
G.Skill's DDR5-6400 CL30 39-39-102 modules have a capacity of 16 GB and will be available in 32 GB dual-channel kits, though the company does not disclose voltages, which are likely considerably higher than those standardized by JEDEC.
The company plans to make its DDR5-6400 modules available both for AMD systems with EXPO profiles (Trident Z5 Neo RGB and Trident Z5 Royal Neo) and for Intel-powered PCs with XMP 3.0 profiles (Trident Z5 RGB and Trident Z5 Royal). For AMD AM5 systems that have a practical limitation of 6000 MT/s – 6400 MT/s for DDR5 memory (as this is roughly as fast as AMD's Infinity Fabric can operate at with a 1:1 ratio), the new modules will be particularly beneficial for AMD's Ryzen 7000 and Ryzen 9000-series processors.
G.Skill notes that since its modules are non-standard, they will not work with all systems but will operate on high-end motherboards with properly cooled CPUs.
The new ultra-low-latency memory kits will be available worldwide from G.Skill's partners starting in late August 2024. The company did not disclose the pricing of these modules, but since we are talking about premium products that boast unique specifications, they are likely to be priced accordingly.
MemoryMicrochip recently announced the availability of their second PCIe Gen 5 enterprise SSD controller - the Flashtec 5016. Like the 4016, this is also a 16-channel controller, but there are some key updates:
Microchip's enterprise SSD controllers provide a high level of flexibility to SSD vendors by providing them with significant horsepower and accelerators. The 5016 includes Cortex-A53 cores for SSD vendors to run custom applications relevant to SSD management. However, compared to the Gen4 controllers, there are two additional cores in the CPU cluster. The DRAM subsystem includes ECC support (both out-of-band and inline, as desired by the SSD vendor).
At FMS 2024, the company demonstrated an application of the neural network engines embedded in the Gen5 controllers. Controllers usually employ a 'read-retry' operation with altered read-out voltages for flash reads that do not complete successfully. Microchip implemented a machine learning approach to determine the read-out voltage based on the health history of the NAND block using the NN engines in the controller. This approach delivers tangible benefits for read latency and power consumption (thanks to a smaller number of errors on the first read).
The 4016 and 5016 come with a single-chip root of trust implementation for hardware security. A secure boot process with dual-signature authentication ensures that the controller firmware is not maliciously altered in the field. The company also brought out the advantages of their controller's implementation of SR-IOV, flexible data placement, and zoned namespaces along with their 'credit engine' scheme for multi-tenant cloud workloads. These aspects were also brought out in other demonstrations.
Microchip's press release included quotes from the usual NAND vendors - Solidigm, Kioxia, and Micron. On the customer front, Longsys has been using Flashtec controllers in their enterprise offerings along with YMTC NAND. It is likely that this collaboration will continue further using the new 5016 controller.
Storage
0 Comments