Introduction#

LBM-Bench is a Lattice-Boltzmann Method (LBM) benchmark suite for evaluating the performance of different LBM propagation and collision kernel implementations. It uses a D3Q19 lattice model with a two-relaxation-time (TRT) collision operator and provides multiple compute kernels that systematically vary the propagation pattern, data layout, and optimization strategy.

Dwarf Classification#

LBM-Bench implements the Structured Grids dwarf from the Berkeley taxonomy. The Lattice-Boltzmann Method operates on a regular 3D grid where each lattice node stores 19 probability distribution functions (PDFs) that are updated every timestep through a collision step (local computation) and a propagation step (nearest-neighbor data exchange). This stencil-like access pattern with regular data dependencies is characteristic of the structured grid dwarf. The computational intensity is low relative to the data movement, making LBM kernels typically memory-bandwidth bound.

Key Features#

  • Compute kernels (7 variants):
    • Push and pull propagation in both AoS and SoA data layouts
    • Cache-blocked push and pull variants (8x8x8 tiles)
    • AA (alternating access) pattern with halved memory footprint
  • Geometry types: box, channel, pipe, blocks, fully periodic fluid domain
  • Collision operator: Two-relaxation-time (TRT) with even/odd decomposition
  • Boundary conditions: Zou-He velocity and pressure boundaries
  • Parallelism: OpenMP for shared memory
  • Instrumentation: Optional LIKWID marker API for hardware counter profiling
  • Verification mode: Automatic comparison against analytical channel flow solution with L2 error norms
  • Performance metrics: Runtime, memory bandwidth (GB/s), MFLUP/s (million fluid lattice updates per second)

Getting Started#

Clone the LBM-Bench repository and follow the build and usage instructions in the README.

Performance Characteristics#

LBM kernels are memory-bandwidth bound due to the low arithmetic intensity of the collision and propagation steps relative to the 19 PDFs that must be loaded and stored per lattice node. The choice of propagation pattern (push vs. pull) and data layout (AoS vs. SoA) significantly affects cache utilization and SIMD vectorization. The SoA layout enables better vectorization as PDFs of the same direction are stored contiguously, while the AoS layout groups all PDFs of a single node for better spatial locality during collision. Cache-blocked variants improve temporal reuse by processing 8x8x8 tiles. The AA pattern eliminates the need for a second PDF array by alternating between in-place collision with direction-swapping and push-like streaming, halving memory requirements and effectively doubling the achievable performance for bandwidth-bound execution. Key tuning dimensions include the kernel variant, thread count and affinity, and domain dimensions.

Citations#

M. Wittmann, V. Haag, T. Zeiser, H. Kostler, and G. Wellein: Lattice Boltzmann Benchmark Kernels as a Testbed for Performance Analysis. Computers & Fluids, Special Issue DSFD2017, 2018. DOI: 10.1016/j.compfluid.2018.03.030

Credits & License#

LBM-Bench is developed by the Erlangen National High Performance Computing Center (NHR@FAU) at the University of Erlangen-Nuremberg.

Licensed under the MIT License.