Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Jan 19, 2026

Rationale for this change

Counting the set bits in a null bitmap is an operation that comes often, it can be useful to get a more precise idea of its performance.

What changes are included in this PR?

  1. Add a benchmark for CountSetBits.
  2. Hand-unroll its inner loop for better performance as otherwise the compiler may not respect the nested loop hint.

Local results (AMD Zen 2):

------------------------------------------------------------------------------
Benchmark                    Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------
CountSetBits/16           5.27 ns         5.27 ns    133267114 bytes_per_second=2.82991Gi/s
CountSetBits/1024         35.1 ns         35.1 ns     19960309 bytes_per_second=27.178Gi/s
CountSetBits/131072       3703 ns         3702 ns       184743 bytes_per_second=32.9698Gi/s

Local results (Intel(R) Core(TM) Ultra 7 255H):

------------------------------------------------------------------------------
Benchmark                    Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------
CountSetBits/16           2.45 ns         2.45 ns    285392946 bytes_per_second=6.08012Gi/s
CountSetBits/1024         28.9 ns         28.9 ns     23618777 bytes_per_second=33.0086Gi/s
CountSetBits/131072       3490 ns         3489 ns       198472 bytes_per_second=34.9862Gi/s

Are these changes tested?

By running said benchmark manually (and by Continuous Benchmarking).

Are there any user-facing changes?

No.

@pitrou
Copy link
Member Author

pitrou commented Jan 19, 2026

@wgtmac @zanmato1984

@pitrou
Copy link
Member Author

pitrou commented Jan 19, 2026

Also @AntoinePrv FYI

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 20, 2026
@raulcd
Copy link
Member

raulcd commented Jan 21, 2026

@ursabot please benchmark

@pitrou
Copy link
Member Author

pitrou commented Jan 21, 2026

Hmm, it seems performance is behind the expected theoretical throughput.

From Agner Fog's instruction tables, I see that AMD Zen 2 should be able to sustain 4 POPCNT operations/cycle (reciprocal throughput = 0.25), i.e. 32 bytes/cycle on 64-bit ints.

@pitrou pitrou force-pushed the gh48897-countsetbits-benchmark branch from 08383d7 to 9921e9d Compare January 21, 2026 09:22
@pitrou
Copy link
Member Author

pitrou commented Jan 21, 2026

Ok, the nested for-loop is un-nested by gcc 15.2.0...

@pitrou pitrou force-pushed the gh48897-countsetbits-benchmark branch from 9921e9d to d0f45cf Compare January 21, 2026 09:43
@pitrou pitrou changed the title GH-48897: [C++] Add benchmark for CountSetBits GH-48897: [C++] Benchmark and optimize CountSetBits Jan 21, 2026
@pitrou
Copy link
Member Author

pitrou commented Jan 21, 2026

Updated benchmark numbers after I hand-unrolled the loop.

@pitrou
Copy link
Member Author

pitrou commented Jan 21, 2026

@github-actions crossbow submit -g cpp

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@github-actions
Copy link

Revision: d0f45cf

Submitted crossbow builds: ursacomputing/crossbow @ actions-cdbe33a753

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-debian-experimental-cpp-gcc-15 GitHub Actions
test-fedora-42-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

@pitrou pitrou merged commit ed35594 into apache:main Jan 21, 2026
54 of 55 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Jan 21, 2026
@pitrou pitrou deleted the gh48897-countsetbits-benchmark branch January 21, 2026 10:35
@rok
Copy link
Member

rok commented Jan 21, 2026

@ursabot please benchmark

@pitrou
Copy link
Member Author

pitrou commented Jan 21, 2026

@rok I have deleted the branch, so I'm not sure that can work?

@rok
Copy link
Member

rok commented Jan 21, 2026

I see the event on kubernetes, but the github api token was expired so it couldn't post back.
Edit: you're probably right about the closed PR though.

@rok
Copy link
Member

rok commented Jan 21, 2026

Trying on #48907

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit ed35594.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 10 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants