-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-38558: [C++] Add support for null sort option per sort key #46926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1.Reconstruct the SortKey structure and add NullPlacement. 2.Remove NullPlacement from SortOptions 3.Fix selectk not displaying non-empty results in null AtEnd scenario. When limit k is greater than the actual table data and the table contains Null/NaN, the data cannot be obtained and only non-empty results are available. Therefore, we support returning non-null and supporting the order of setting Null for each SortKey. 4.Add relevant unit tests and change the interface implemented by multiple versions
…8558 # Conflicts: # c_glib/arrow-glib/compute.cpp # c_glib/arrow-glib/compute.h # cpp/src/arrow/compute/kernels/vector_rank.cc # cpp/src/arrow/compute/kernels/vector_select_k.cc # cpp/src/arrow/compute/kernels/vector_sort.cc # cpp/src/arrow/compute/kernels/vector_sort_internal.h # python/pyarrow/_acero.pyx # python/pyarrow/_compute.pyx # python/pyarrow/array.pxi # python/pyarrow/tests/test_compute.py # python/pyarrow/tests/test_table.py
# Conflicts: # cpp/src/arrow/compute/api_vector.cc # cpp/src/arrow/compute/api_vector.h # cpp/src/arrow/compute/kernels/vector_rank.cc # cpp/src/arrow/compute/kernels/vector_select_k.cc # cpp/src/arrow/compute/kernels/vector_sort.cc # cpp/src/arrow/compute/kernels/vector_sort_internal.h # cpp/src/arrow/compute/kernels/vector_sort_test.cc # cpp/src/arrow/compute/ordering.cc # cpp/src/arrow/compute/ordering.h
…most likely human-error while merging)
| options->null_placement = garrow_optional_null_placement_to_raw( | ||
| static_cast<GArrowOptionalNullPlacement>(g_value_get_enum(value))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| options->null_placement = garrow_optional_null_placement_to_raw( | |
| static_cast<GArrowOptionalNullPlacement>(g_value_get_enum(value))); | |
| { | |
| auto null_placement = static_cast<GArrowOptionalNullPlacement>(g_value_get_enum(value)); | |
| if (null_placement == GARROW_OPTIONAL_NULL_PLACEMENT_UNSPECIFIED) { | |
| options->null_placement = std::nullopt; | |
| } else { | |
| options->null_placement = null_placement | |
| } | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be missing a static_cast:
{
auto null_placement = static_cast<GArrowOptionalNullPlacement>(g_value_get_enum(value));
if (null_placement == GARROW_OPTIONAL_NULL_PLACEMENT_UNSPECIFIED) {
options->null_placement = std::nullopt;
} else {
options->null_placement = static_cast<arrow::compute::NullPlacement>(null_placement);
}
}
Are you sure this is the cleaner way to go? Should I replace all usage of the helper function garrow_optional_null_placement_to_raw like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we have multiple places that use these helper functions. I missed them. Sorry.
Could you export these helper functions from compute.hpp instead of defining them in an anonymous namespace? And could you simplify these helper functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I did this, but this lead to the following change:
The return type is a cpp object and not a pointer as for most other exported functions. Therefore, its definition must be outside the G_BEGIN_DECLS/G_END_DECLS (these define extern "C"). I moved it there, but it makes it feel a little out of place. No other definition before the line G_BEGIN_DECLS is exported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move them after the G_END_DECLS
arrow/c_glib/arrow-glib/compute.cpp
Line 10666 in 75ef031
| G_END_DECLS |
G_END_DECLS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this was embarrassing to miss. Thanks for the pointer!
|
Thank you for triggering the CI! I fixed the remaining I pushed them onto a separate branch, but they are not required for passing the CI: |
|
If you dislike them, I could remove the helper functions entirely as seen in: |
|
I see that there is an unintended API breakage for ruby:
|
ruby/red-arrow/lib/arrow/sort-key.rb
Outdated
| # for recreatable, we should remove suffix | ||
| if target.end_with?("_at_start") | ||
| suffix_length = "_at_start".length | ||
| target = target[0..-(suffix_length + 1)] | ||
| elsif target.end_with?("_at_end") | ||
| suffix_length = "_at_end".length | ||
| target = target[0..-(suffix_length + 1)] | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, how about ^ for :at_start and $ for :at_end?
For example:
-column->["column", :descending, null_placement]+column->["column", :ascending, null_placement]-^column-> ["column", :descending, :at_start]`-$column-> ["column", :descending, :at_end]`+^column-> ["column", :ascending, :at_start]`+$column-> ["column", :ascending, :at_end]`^column-> ["column", order, :at_start]`$column-> ["column", order, :at_end]`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we omit the $ in the to_s() function (as :at_end is the default value)? Otherwise, it might be confusing for users that never used the null_placement option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good idea. Let's do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe this was a bad idea, I just noticed that this makes them no longer safely round-trip. E.g.:
Arrow::SortKey.new(
# Inner key will printed as "+^count"
Arrow::SortKey.new(
"^count",
Arrow::SortOrder::ASCENDING,
Arrow::NullPlacement::AtEnd
).to_s
)
will have a null_placement of AtStart
|
Just to confirm, I just noticed that the breakage I pasted above: Is not in the red-arrow ruby gem, but only in the GI bindings. Is this tolerable? I.e. are they only used for internal tests as the one I pasted, or are they part of arrow's published api? Sorry for my lack of knowledge in this regard |
Yes. You're right.
They are part of published API like
No problem. It's (a bit?) confusing... |
| options->null_placement = garrow_optional_null_placement_to_raw( | ||
| static_cast<GArrowOptionalNullPlacement>(g_value_get_enum(value))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move them after the G_END_DECLS
arrow/c_glib/arrow-glib/compute.cpp
Line 10666 in 75ef031
| G_END_DECLS |
G_END_DECLS.
c_glib/arrow-glib/compute.h
Outdated
| * NaNs will come before nulls. | ||
| * @GARROW_OPTIONAL_NULL_PLACEMENT_UNSPECIFIED: | ||
| * Do not specify null placement. | ||
| * Null placement should instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this garbage?
| * Null placement should instead |
c_glib/arrow-glib/compute.h
Outdated
| * Do not specify null placement. | ||
| * Null placement should instead | ||
| * | ||
| * They are corresponding to `std::optional<arrow::compute::NullPlacement>` values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * They are corresponding to `std::optional<arrow::compute::NullPlacement>` values. | |
| * They are corresponding to `arrow::compute::NullPlacement` values except | |
| * `GARROW_OPTIONAL_NULL_PLACEMENT_UNSPECIFIED`. | |
| * `GARROW_OPTIONAL_NULL_PLACEMENT_UNSPECIFIED` is used to specify | |
| * `std::nullopt`. |
c_glib/arrow-glib/compute.h
Outdated
| GARROW_OPTIONAL_NULL_PLACEMENT_AT_START, | ||
| GARROW_OPTIONAL_NULL_PLACEMENT_AT_END, | ||
| GARROW_OPTIONAL_NULL_PLACEMENT_UNSPECIFIED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using -1 that will be never used in arrow::compute::NullPlacement?
| GARROW_OPTIONAL_NULL_PLACEMENT_AT_START, | |
| GARROW_OPTIONAL_NULL_PLACEMENT_AT_END, | |
| GARROW_OPTIONAL_NULL_PLACEMENT_UNSPECIFIED, | |
| GARROW_OPTIONAL_NULL_PLACEMENT_UNSPECIFIED = -1, | |
| GARROW_OPTIONAL_NULL_PLACEMENT_AT_START, | |
| GARROW_OPTIONAL_NULL_PLACEMENT_AT_END, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea!
…RROW_OPTIONAL_NULL_PLACEMENT_UNSPECIFIED correspond to -1. All other (possibly future) values will have a 1:1 mapping
…unctions in c_glib/arrow-glib/compute.{hpp,cpp}
See #38584 for original PR. Will be quoted for this PR description.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
This PR includes breaking changes to public APIs. (If there are any breaking changes to public APIs, please explain which changes are breaking. If not, you can remove this.)
I amended the original PR to be less breaking in public APIs.
Still Ordering, SortOptions, RankOptions, and RankQuantileOptions now accept a
std::optional<NullPlacement>instead of NullPlacement, which did lead to some changes in downstream APIs and bindings.I also need some help with fixing thec_glibbindings.