Skip to content

Conversation

@alxmrs
Copy link
Owner

@alxmrs alxmrs commented Jan 15, 2026

Fixes #17, #93.

…rom arrow streams, which are lazy, to implement the table registration, which in the default df library, is not lazy enough.
…(XarrayContext). This leads to two failed tests, but this could be caused by test errors.
Copy link
Owner Author

@alxmrs alxmrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few notes.

The previous implementation stored a single Arrow stream that could only
be consumed once, causing subsequent queries on the same table to return
empty results. This broke filters and aggregations.

Changes:
- Modify Rust PyArrowStreamPartition to accept a factory function instead
  of a stream object. The factory is called on each execute() to create
  a fresh stream, allowing multiple queries on the same table.
- Update LazyArrowStreamTable to take a factory and schema instead of
  consuming a stream directly.
- Update Python read_xarray_table to create a factory function that
  produces fresh XarrayRecordBatchReader instances.
- Update tests to use the new factory-based API via read_xarray_table.

This enables proper lazy evaluation while supporting multiple queries
on registered tables, fixing the failing filter and aggregation tests.
XarrayRecordBatchReader already implements __arrow_c_stream__, so there's
no need to wrap it in pa.RecordBatchReader.from_stream(). The Rust code
can consume it directly via ArrowArrayStreamReader::from_pyarrow_bound.
@alxmrs alxmrs requested a review from maximedion2 January 25, 2026 06:40
&self.schema
}

fn execute(&self, _ctx: Arc<TaskContext>) -> SendableRecordBatchStream {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC: @maximedion2 I could use your expertise and attention here. This is new since the last review.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC: @maximedion2 I could use your expertise and attention here. This is new since the last review.

So, I've checked a couple times today, it seems you've been changing this part of the code a few times? Are you still working on this?

The python <> rust part is confusing me a bit, I still have to wrap my head around grabbing the GIL everywhere, but generally speaking even without that, I'm not sure I completely follow. So there's 2 concepts here, a partition and a batch. In datafusion, from what I understand, partitions run in parallel, batches are streamed sequentially, for a given partition. For example, if you have 2 folders with 10 parquet files in each, you can spin up 2 partitions, each of which will produce 10 batches. The purpose of the partitions is to allow parallelism, and the purpose of batches is to allow work to progress down the the execution graph while earlier steps are running, i.e. as long as you don't need to materialize all the data, your first batch can go through operations while the second batch is read, third batch is read, etc...

Here from what I understand you are producing a single partition, that's the purpose of PartitionStream. That partition will produce multiple batches, sequentially, so I don't think I understand why there's a mutex?

The "state" pattern you are using is what I initially did for my zarr stream, see here https://github.com/datafusion-contrib/arrow-zarr/blob/fa08cf93f369159ce09e843d68ab90fd64a3d3b3/src/zarr_store_opener/zarr_data_stream.rs#L790. It's quite complicated, the state has to hold the futures to keep track of them, etc... Kyle refactored this part to what it looks like now (see this https://github.com/datafusion-contrib/arrow-zarr/blob/e61b8df47fc3dfd1a95b305c82420276b9efab86/src/zarr_store_opener/zarr_data_stream.rs#L757-L789), on the PR we had a discussion but in the end I did like what he did, simple and cleaner approach I think.

I think, given that the batch production logic here is fairly straightforward, that there's probably a more direct way to generate the record batch stream? I'll have to look more into this, like I said I'm still not completely comfortable with the back and forth with python haha.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, looking more at this, you might not be able to directly convert your reader into a stream because it's a python class, but maybe you can still use the try_stream macro in which you call "read_next_batch" through the python interface? your reader already handles returning None when it's done reading right, so that might be a short and clean approach? not 100% sure if the calls to the python interpreter would break this, but maybe worth a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Opening large Zarr datasets should be lazy (and fast)

4 participants