Thread Block Clusters
The CUDA programming model has long encompassed Threads, Thread Blocks, and Grids. The Hopper architecture adds another level to the hierarchy: Thread Block Clusters. The new level in the hierarchy exists between Grids and Thread Blocks. Its functionality enables increased programmatic control of data locality.
Thread Block Clusters further the capabilities of, among others, the CUDA Cooperative Groups API. Thread Blocks participating in Thread Block Clusters are guaranteed to be scheduled concurrently, allowing for finer-grained parallelism across Thread Blocks running on SMs. Going even further, Nvidia introduces a specialized SM-to-SM interconnect network, allowing SMs to exchange shared memory directly instead of through Global Memory.