Skip to content

Conversation

@guillaume-rochette-oxb
Copy link

@guillaume-rochette-oxb guillaume-rochette-oxb commented Jan 23, 2026

Rationale for this change

Hi everyone,
This PR solve the issue #48962.
In the sense that it adds a functionality enabling to dynamically restack/resize a stream of pa.RecordBatch w.r.t. to minimums and maximums of rows and bytes.
That would means that too large batches would be chunked in smaller ones, and conversely small batches would be concatenated into bigger ones.
That way we would have more predictable resource usage for parallel processing tasks.

What changes are included in this PR?

The function restack_batches() and its unit tests 😃

Are these changes tested?

Yes 🫡

Are there any user-facing changes?

No 🙅

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@guillaume-rochette-oxb guillaume-rochette-oxb changed the title Added restack.py and test_restack.py GH-${48962}: [Python] ${SUMMARY} Add restack_batches() Jan 23, 2026
@guillaume-rochette-oxb guillaume-rochette-oxb changed the title GH-${48962}: [Python] ${SUMMARY} Add restack_batches() GH-48962: [Python] Add restack_batches() Jan 23, 2026
@github-actions
Copy link

⚠️ GitHub issue #48962 has been automatically assigned in GitHub to PR creator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant