Skip to content

Conversation

@Taepper
Copy link

@Taepper Taepper commented Jun 27, 2025

See #38584 for original PR. Will be quoted for this PR description.

Rationale for this change

support multi sortkey nulls first.

order by i nulls first, j, k nulls first;

The current null sorting only supports all sortkeys, not a certain sortkey, so NullPlacement is extended to the SortKey field. Since the underlying framework is very well written, when modifying this function, you only need to pass the null_placement of each SortKey in. That’s it.

What changes are included in this PR?

1.SortKey structure, NullPlacemnt transfer logic, sorting logic and Ording related, test related
2.Substriait related.
3.c_glib related.
4.SelectK related.
5.RankOptions related.

Are these changes tested?

yes, I changed the code inside vector_sort_test.cc and performed additional tests.

Are there any user-facing changes?

yes, pg database include null sorting of multiple sort keys.

This PR includes breaking changes to public APIs. (If there are any breaking changes to public APIs, please explain which changes are breaking. If not, you can remove this.)

I amended the original PR to be less breaking in public APIs.

Still Ordering, SortOptions, RankOptions, and RankQuantileOptions now accept a std::optional<NullPlacement> instead of NullPlacement, which did lead to some changes in downstream APIs and bindings. I also need some help with fixing the c_glib bindings.

Light-City and others added 30 commits November 9, 2023 09:57
1.Reconstruct the SortKey structure and add NullPlacement.

2.Remove NullPlacement from SortOptions

3.Fix selectk not displaying non-empty results in null AtEnd scenario.

When limit k is greater than the actual table data and the table contains Null/NaN, the data cannot be obtained and only non-empty results are available.
Therefore, we support returning non-null and supporting the order of setting Null for each SortKey.

4.Add relevant unit tests and change the interface implemented by multiple versions
…8558

# Conflicts:
#	c_glib/arrow-glib/compute.cpp
#	c_glib/arrow-glib/compute.h
#	cpp/src/arrow/compute/kernels/vector_rank.cc
#	cpp/src/arrow/compute/kernels/vector_select_k.cc
#	cpp/src/arrow/compute/kernels/vector_sort.cc
#	cpp/src/arrow/compute/kernels/vector_sort_internal.h
#	python/pyarrow/_acero.pyx
#	python/pyarrow/_compute.pyx
#	python/pyarrow/array.pxi
#	python/pyarrow/tests/test_compute.py
#	python/pyarrow/tests/test_table.py
# Conflicts:
#	cpp/src/arrow/compute/api_vector.cc
#	cpp/src/arrow/compute/api_vector.h
#	cpp/src/arrow/compute/kernels/vector_rank.cc
#	cpp/src/arrow/compute/kernels/vector_select_k.cc
#	cpp/src/arrow/compute/kernels/vector_sort.cc
#	cpp/src/arrow/compute/kernels/vector_sort_internal.h
#	cpp/src/arrow/compute/kernels/vector_sort_test.cc
#	cpp/src/arrow/compute/ordering.cc
#	cpp/src/arrow/compute/ordering.h
Comment on lines +4405 to +4406
options->null_placement = garrow_optional_null_placement_to_raw(
static_cast<GArrowOptionalNullPlacement>(g_value_get_enum(value)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
options->null_placement = garrow_optional_null_placement_to_raw(
static_cast<GArrowOptionalNullPlacement>(g_value_get_enum(value)));
{
auto null_placement = static_cast<GArrowOptionalNullPlacement>(g_value_get_enum(value));
if (null_placement == GARROW_OPTIONAL_NULL_PLACEMENT_UNSPECIFIED) {
options->null_placement = std::nullopt;
} else {
options->null_placement = null_placement
}
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be missing a static_cast:

    {
      auto null_placement = static_cast<GArrowOptionalNullPlacement>(g_value_get_enum(value));
      if (null_placement == GARROW_OPTIONAL_NULL_PLACEMENT_UNSPECIFIED) {
        options->null_placement = std::nullopt;
      } else {
        options->null_placement = static_cast<arrow::compute::NullPlacement>(null_placement);
      }
    }

Are you sure this is the cleaner way to go? Should I replace all usage of the helper function garrow_optional_null_placement_to_raw like this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we have multiple places that use these helper functions. I missed them. Sorry.

Could you export these helper functions from compute.hpp instead of defining them in an anonymous namespace? And could you simplify these helper functions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I did this, but this lead to the following change:

The return type is a cpp object and not a pointer as for most other exported functions. Therefore, its definition must be outside the G_BEGIN_DECLS/G_END_DECLS (these define extern "C"). I moved it there, but it makes it feel a little out of place. No other definition before the line G_BEGIN_DECLS is exported.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move them after the G_END_DECLS

G_END_DECLS
? We place exported C++ functions after G_END_DECLS.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this was embarrassing to miss. Thanks for the pointer!

@Taepper
Copy link
Author

Taepper commented Jan 23, 2026

Thank you for triggering the CI!

I fixed the remaining c_glib (sorry for not merging in an up-to-date origin/main before), and all your comments except your last two.

I pushed them onto a separate branch, but they are not required for passing the CI:
Taepper@8aa6733

@Taepper
Copy link
Author

Taepper commented Jan 23, 2026

If you dislike them, I could remove the helper functions entirely as seen in:
Taepper@1c15e9b

@Taepper
Copy link
Author

Taepper commented Jan 23, 2026

I see that there is an unintended API breakage for ruby:

/Users/runner/work/arrow/arrow/c_glib/test/test-select-k-options.rb:33:in `test_sort_keys'
     30: 
     31:   def test_sort_keys
     32:     sort_keys = [
  => 33:       Arrow::SortKey.new("column1", :ascending),
     34:       Arrow::SortKey.new("column2", :descending),
     35:     ]
     36:     @options.sort_keys = sort_keys
Error: ArgumentError: wrong arguments: Arrow::SortKey#initialize("column1", :ascending):

SortKey without a specified null-placement should be constructable with a default NullPlacement of AtEnd

Comment on lines 178 to 185
# for recreatable, we should remove suffix
if target.end_with?("_at_start")
suffix_length = "_at_start".length
target = target[0..-(suffix_length + 1)]
elsif target.end_with?("_at_end")
suffix_length = "_at_end".length
target = target[0..-(suffix_length + 1)]
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, how about ^ for :at_start and $ for :at_end?

For example:

  • -column -> ["column", :descending, null_placement]
  • +column -> ["column", :ascending, null_placement]
  • -^column -> ["column", :descending, :at_start]`
  • -$column -> ["column", :descending, :at_end]`
  • +^column -> ["column", :ascending, :at_start]`
  • +$column -> ["column", :ascending, :at_end]`
  • ^column -> ["column", order, :at_start]`
  • $column -> ["column", order, :at_end]`

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we omit the $ in the to_s() function (as :at_end is the default value)? Otherwise, it might be confusing for users that never used the null_placement option

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea. Let's do it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe this was a bad idea, I just noticed that this makes them no longer safely round-trip. E.g.:

Arrow::SortKey.new(
  # Inner key will printed as "+^count"
  Arrow::SortKey.new(
    "^count",
    Arrow::SortOrder::ASCENDING,
    Arrow::NullPlacement::AtEnd
  ).to_s
)

will have a null_placement of AtStart

@Taepper
Copy link
Author

Taepper commented Jan 23, 2026

Just to confirm, I just noticed that the breakage I pasted above:

/Users/runner/work/arrow/arrow/c_glib/test/test-select-k-options.rb:33:in `test_sort_keys'
     30: 
     31:   def test_sort_keys
     32:     sort_keys = [
  => 33:       Arrow::SortKey.new("column1", :ascending),
     34:       Arrow::SortKey.new("column2", :descending),
     35:     ]
     36:     @options.sort_keys = sort_keys
Error: ArgumentError: wrong arguments: Arrow::SortKey#initialize("column1", :ascending):

Is not in the red-arrow ruby gem, but only in the GI bindings. Is this tolerable? I.e. are they only used for internal tests as the one I pasted, or are they part of arrow's published api?

Sorry for my lack of knowledge in this regard

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Jan 23, 2026
@kou
Copy link
Member

kou commented Jan 23, 2026

Is not in the red-arrow ruby gem, but only in the GI bindings. Is this tolerable?

Yes. You're right.

I.e. are they only used for internal tests as the one I pasted, or are they part of arrow's published api?

They are part of published API like c_glib/*/*.{h,hpp}. But these API breakage are acceptable because we'll use "major version up" for the next release. "Major version up" means that we have API breakages in semantic versioning.

Sorry for my lack of knowledge in this regard

No problem. It's (a bit?) confusing...

Comment on lines +4405 to +4406
options->null_placement = garrow_optional_null_placement_to_raw(
static_cast<GArrowOptionalNullPlacement>(g_value_get_enum(value)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move them after the G_END_DECLS

G_END_DECLS
? We place exported C++ functions after G_END_DECLS.

* NaNs will come before nulls.
* @GARROW_OPTIONAL_NULL_PLACEMENT_UNSPECIFIED:
* Do not specify null placement.
* Null placement should instead
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this garbage?

Suggested change
* Null placement should instead

* Do not specify null placement.
* Null placement should instead
*
* They are corresponding to `std::optional<arrow::compute::NullPlacement>` values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* They are corresponding to `std::optional<arrow::compute::NullPlacement>` values.
* They are corresponding to `arrow::compute::NullPlacement` values except
* `GARROW_OPTIONAL_NULL_PLACEMENT_UNSPECIFIED`.
* `GARROW_OPTIONAL_NULL_PLACEMENT_UNSPECIFIED` is used to specify
* `std::nullopt`.

Comment on lines 529 to 531
GARROW_OPTIONAL_NULL_PLACEMENT_AT_START,
GARROW_OPTIONAL_NULL_PLACEMENT_AT_END,
GARROW_OPTIONAL_NULL_PLACEMENT_UNSPECIFIED,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using -1 that will be never used in arrow::compute::NullPlacement?

Suggested change
GARROW_OPTIONAL_NULL_PLACEMENT_AT_START,
GARROW_OPTIONAL_NULL_PLACEMENT_AT_END,
GARROW_OPTIONAL_NULL_PLACEMENT_UNSPECIFIED,
GARROW_OPTIONAL_NULL_PLACEMENT_UNSPECIFIED = -1,
GARROW_OPTIONAL_NULL_PLACEMENT_AT_START,
GARROW_OPTIONAL_NULL_PLACEMENT_AT_END,

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

…RROW_OPTIONAL_NULL_PLACEMENT_UNSPECIFIED correspond to -1. All other (possibly future) values will have a 1:1 mapping
…unctions in c_glib/arrow-glib/compute.{hpp,cpp}
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jan 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants