Skip to content

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Jan 21, 2026

Motivation

I wanted this to debug #19858 and thought this would be helpful. Opening a PR in case others think this would be useful. I do think we should be cautious of "config overload" so I'm not going to push to merge this unless others think they'd actually use this outside of the context of debugging.

Summary

  • Adds hash_join_map_pushdown config option to control hash table reference pushdown
  • Adds hash_join_bounds_pushdown config option to control min/max bounds pushdown

Details

This PR adds two new configuration options to independently control hash join dynamic filter pushdown behavior:

  • hash_join_map_pushdown (default: true): Controls whether to push down hash table references for membership checks when InList thresholds are exceeded. When false, no membership filter is created if the build side is too large for InList pushdown.

  • hash_join_bounds_pushdown (default: true): Controls whether to push down min/max bounds for join key columns. When false, only membership filters (InList or Map) are pushed down.

This enables flexible combinations:

  • InList only (no bounds, no map)
  • Bounds only (no inlist, no map)
  • Map only (no bounds)
  • InList + bounds (no map)
  • Any combination thereof

The primary use case is allowing users to use InList + bounds without the map fallback, which can be beneficial for certain workloads where the hash table reference overhead outweighs its benefits.

Test plan

  • Existing hash_join tests pass (782 tests)
  • CI validation

🤖 Generated with Claude Code

This adds two new configuration options to independently control hash join
dynamic filter pushdown behavior:

- `hash_join_map_pushdown` (default: true): Controls whether to push down
  hash table references for membership checks when InList thresholds are
  exceeded. When false, no membership filter is created if the build side
  is too large for InList pushdown.

- `hash_join_bounds_pushdown` (default: true): Controls whether to push
  down min/max bounds for join key columns. When false, only membership
  filters (InList or Map) are pushed down.

This enables flexible combinations:
- InList only (no bounds, no map)
- Bounds only (no inlist, no map)
- Map only (no bounds)
- InList + bounds (no map)
- Any combination thereof

The primary use case is allowing users to use InList + bounds without
the map fallback, which can be beneficial for certain workloads where
the hash table reference overhead outweighs its benefits.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings January 21, 2026 19:58
@github-actions github-actions bot added common Related to common crate physical-plan Changes to the physical-plan crate labels Jan 21, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds two new configuration options to control hash join dynamic filter pushdown behavior independently. The hash_join_map_pushdown config controls whether to use hash table references for membership checks when InList thresholds are exceeded, while hash_join_bounds_pushdown controls whether to push down min/max bounds for join key columns. This provides fine-grained control over the types of dynamic filters used in hash joins.

Changes:

  • Added hash_join_map_pushdown boolean config option (default: true) to control hash table reference pushdown
  • Added hash_join_bounds_pushdown boolean config option (default: true) to control min/max bounds pushdown
  • Introduced PushdownStrategy::Disabled enum variant to represent when map pushdown is disabled

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
datafusion/common/src/config.rs Adds two new configuration options for controlling hash join dynamic filter pushdown behavior
datafusion/physical-plan/src/joins/hash_join/exec.rs Reads the new config options and applies them when determining pushdown strategy and initializing SharedBuildAccumulator
datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs Adds bounds_pushdown_enabled field to SharedBuildAccumulator, introduces Disabled pushdown strategy variant, and conditionally creates bounds predicates based on config

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1118 to +1120
/// When true, pushes down hash table references for membership checks in hash joins
/// when the build side is too large for InList pushdown.
/// When false, no membership filter is created when InList thresholds are exceeded.
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation states "When false, no membership filter is created when InList thresholds are exceeded." This could be clearer. Consider rephrasing to: "When false, no hash table reference (Map) is used for membership checks. If the build side exceeds InList thresholds, no membership filter is created (only bounds, if enabled)." This makes it clearer that InList filters can still be created when the build side is small enough, and that this config specifically controls the Map fallback behavior.

Suggested change
/// When true, pushes down hash table references for membership checks in hash joins
/// when the build side is too large for InList pushdown.
/// When false, no membership filter is created when InList thresholds are exceeded.
/// When true, pushes down hash table references (Map) for membership checks in hash joins
/// when the build side is too large for InList pushdown.
/// When false, no hash table reference (Map) is used for membership checks. If the build
/// side exceeds InList thresholds, no membership filter is created (only bounds, if enabled).

Copilot uses AI. Check for mistakes.
Comment on lines +1122 to +1128
pub hash_join_map_pushdown: bool, default = true

/// When true, pushes down min/max bounds for join key columns.
/// This enables statistics-based pruning (e.g., Parquet row group skipping).
/// When false, only membership filters (InList or Map) are pushed down.
/// Default: true
pub hash_join_bounds_pushdown: bool, default = true
Copy link

Copilot AI Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new configuration options hash_join_map_pushdown and hash_join_bounds_pushdown lack test coverage. While the PR description mentions existing hash join tests pass, there are no specific tests that validate the behavior when these options are set to false. Consider adding tests similar to the existing test_hashjoin_hash_table_pushdown_partitioned that verify:

  1. When hash_join_map_pushdown is false and build side exceeds InList thresholds, no membership filter is created (only bounds if enabled)
  2. When hash_join_bounds_pushdown is false, bounds predicates are not included in the filter
  3. When both are false but dynamic filter pushdown is enabled, appropriate behavior occurs
  4. Various combinations of these flags work correctly in both Partitioned and CollectLeft modes

Copilot uses AI. Check for mistakes.
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate documentation Improvements or additions to documentation physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant