This project demonstrates the performance impact of different data layouts on CPU cache utilization through a particle physics simulation implemented in C++.
Hypothesis: Data layout significantly impacts CPU cache behavior. Specifically, organizing data as a Structure of Arrays (SoA) should show measurably better cache performance than Array of Structures (AoS) when operations only access a subset of fields.
This benchmark is designed to validate this hypothesis using CodSpeed's walltime instrument, which provides hardware performance counters including cache hit/miss rates, memory bandwidth, and IPC (instructions per cycle).
struct Particle {
Vec3 position; // 12 bytes
Vec3 velocity; // 12 bytes
float mass; // 4 bytes
}; // = 28 bytes per particle (40 with padding)
std::vector<Particle> particles;Memory layout: [pos0, vel0, mass0, pos1, vel1, mass1, pos2, vel2, mass2, ...]
When we only need to update positions, we load entire cache lines containing velocity and mass data that we don't use, wasting bandwidth and cache space.
class ParticleSystem {
public:
std::vector<Vec3> positions;
std::vector<Vec3> velocities;
std::vector<float> masses;
};Memory layout:
positions: [pos0, pos1, pos2, ...]velocities: [vel0, vel1, vel2, ...]masses: [mass0, mass1, mass2, ...]
When we update positions, every byte in the cache line is useful data, maximizing cache efficiency.
- Higher L1/L2/L3 cache miss rates
- Lower memory bandwidth utilization
- More stalls waiting for memory
- Lower cache miss rates (better spatial locality)
- Higher effective memory bandwidth
- Better prefetcher efficiency
- CMake 3.12 or higher
- C++17 compatible compiler
- Build tools (make, gcc/clang)
# Create build directory
mkdir build && cd build
# Configure with CodSpeed simulation mode for profiling
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCODSPEED_MODE=simulation ..
# Build
make -j$(nproc)
# Run benchmarks
./particle_simulation_benchCodSpeed supports different modes via the CODSPEED_MODE CMake option:
off(default): Disables CodSpeed instrumentationsimulation: Runs benchmarks on simulated CPU with hardware counterswalltime: For walltime reports
Example:
# Build for walltime profiling
cmake -DCODSPEED_MODE=walltime ..
makeWhen comparing AoS vs SoA versions with CodSpeed's walltime instrument, you should see:
- Cache Misses: SoA should show significantly fewer L1/L2/L3 cache misses
- Memory Operations: Better cache line utilization in SoA version
- Instructions Per Cycle (IPC): Higher IPC in SoA due to less memory stalls
- Wall Time: SoA should be faster, especially with larger datasets
Each version implements three operations:
-
update_positions:
position = position + velocity * dt- Tests spatial locality when accessing two arrays
-
compute_kinetic_energy:
sum(0.5 * mass * velocity²)- Tests cache behavior when skipping position data
-
apply_gravity:
velocity = velocity + gravity * dt- Tests cache behavior when accessing only one field
The benchmarks use 1,000,000 particles:
- AoS: ~40 MB (28 bytes per particle + padding)
- SoA: ~32 MB (separate arrays for positions, velocities, masses)
This size is large enough to exceed L3 cache on most systems, making cache efficiency differences clearly visible.
cache-patterns-example/
├── include/
│ ├── vec3.h # 3D vector utility
│ ├── aos.h # Array of Structures header
│ └── soa.h # Structure of Arrays header
├── src/
│ ├── aos.cpp # AoS implementation
│ └── soa.cpp # SoA implementation
├── benchmarks/
│ └── particle_simulation.cpp # Google Benchmark benchmarks
├── CMakeLists.txt # CMake build configuration
└── .github/workflows/
└── codspeed.yml # CI workflow for CodSpeed
The project includes a GitHub Actions workflow that automatically runs benchmarks on every push and pull request using CodSpeed. The workflow:
- Builds the project with CMake in RelWithDebInfo mode
- Compiles with CODSPEED_MODE=simulation for hardware counter profiling
- Runs benchmarks and reports performance metrics
- Detects performance regressions automatically
View the workflow configuration in .github/workflows/codspeed.yml.