Tensor Memory Accelerator
The newly added Tensor Memory Accelerator (TMA) enables asynchronous transfers of multidimensional blocks of data. An elected thread within a thread group takes on responsibility for interacting with the TMA by passing along a Copy Descriptor detailing the information the TMA needs to correctly transfer a multidimensional block of data, or tensor. The remaining threads are free to perform other instructions while the TMA operation is underway.
Fourth-Generation Tensor Cores
Fourth-generation tensor cores further improve upon the efficiency of the previous generation. Nvidia has now added support for an 8-bit floating-point datatype: FP8. They support two flavors of FP8, namely E4M3 and E5M2, enabling the choice between dynamic range or precision. The number following the E and the number following the M represent the number of exponent- and mantissa bits respectively. Generic computations that natively match FP8 ranges are few and far between. In the cases where FP8 is sufficient, one can expect great performance improvements over, e.g., FP16. NVIDIA expects their new DGX SuperPOD to be able to deliver 1 exaFLOPS of sparse FP8 compute.
DPX Instructions \cite{bloga}
Algorithms built upon problems where optimal solutions to subproblems constitute an optimal solution to the problem itself rely on dynamic programming. A simple example come from the fibonacci numbers. The n-th fibonacci number is known to be the sum of the two previous fibonacci numbers. Finding the n-th fibonacci number is thus solved by recursively solving sub-problems. Furthermore, subproblems of fibonacci overlap. Other DP algorithms include Dijkstra's shortest path, Floyd-Warshall all-pairs shortest path and Smith-Waterman for sequence alignment.
DP problems benefit from the tabulation (building a solution bottom-up) and memoization (top-down) strategies. Both strategies store results of sub-problems such that recomputation is avoided. The new DPX instruction set aims to speed up dynamic programming with specialized instructions that presumably exploit the characteristics of the DP problems.