The Bandwidth Benchmark

Introduction#

TheBandwidthBenchmark is a collection of simple streaming kernels for measuring the maximum sustained main memory bandwidth of CPU and GPU systems. It also supports sweeping over the complete memory hierarchy using sequential or parallel throughput execution modes.

Dwarf Classification#

TheBandwidthBenchmark implements the Streaming Kernels pattern, closely related to the Dense Linear Algebra dwarf from the Berkeley taxonomy. The kernels perform unit-stride accesses over contiguous arrays with minimal arithmetic, making them purely memory-bandwidth bound. The benchmark is inspired by John McCalpin’s STREAM benchmark and extends it with additional kernels covering different load/store ratios and write-allocate behavior.

Key Features#

Streaming kernels: init, sum, copy, update, triad, daxpy, Schoenauer triad, Schoenauer daxpy – covering all combinations of load/store ratios and write-allocate behavior
Execution modes: Worksharing (default), sequential sweep, throughput sweep for profiling the full memory hierarchy
Parallelism: OpenMP for CPU, CUDA for NVIDIA GPUs
GPU tuning: Configurable thread block size and blocks per SM
Visualization: Gnuplot scripts for bandwidth vs. array size plots
Validation: Results are checked for correctness after each run

Getting Started#

Clone the TheBandwidthBenchmark repository and follow the build and usage instructions in the README. A collection of results is available on the wiki.

Performance Characteristics#

All kernels are memory-bandwidth bound by design. The benchmark measures the sustained bandwidth achievable for different load-to-store ratios and helps identify whether streaming stores (non-temporal stores) are being generated by the compiler. The sequential and throughput sweep modes reveal cache-level bandwidths across the full memory hierarchy (L1, L2, L3, main memory). Scaling behavior across cores within a memory domain is an important system characteristic that can be measured by varying the thread count. Key tuning dimensions include the problem size (must be large enough to amortize OpenMP overhead), thread affinity, and whether streaming stores are enabled.

Citations#

TODO

Credits & License#

TheBandwidthBenchmark is developed by the Erlangen National High Performance Computing Center (NHR@FAU) at the University of Erlangen-Nuremberg.

Licensed under the MIT License.