-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Support JSON arrays reader/parse for datafusion #19924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR extends DataFusion’s JSON support to handle files where the top-level value is a JSON array ([{...}, {...}]), in addition to the existing newline-delimited JSON (NDJSON) format.
Changes:
- Adds a
format_array: booloption toJsonOptions(config, protobuf, and JSON (de)serialization) and wires it throughJsonFormat/JsonSourceinto the JSON execution path. - Implements array-aware schema inference and reading in
datasource-json, including helper functions to infer schemas and read array JSON intoRecordBatches, plus updates examples and tests (unit tests and sqllogictests). - Adds new test data (
json_array.json,json_empty_array.json) and SQLLogic tests to validate array format behavior and the newOPTIONS ('format.format_array' 'true')flag.
Reviewed changes
Copilot reviewed 13 out of 17 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
datafusion/sqllogictest/test_files/json.slt |
Adds end-to-end SQL tests for JSON array input, including a failure case when format_array is not set. |
datafusion/proto/src/logical_plan/file_formats.rs |
Extends JsonOptionsProto ↔ JsonOptions mapping to include the format_array flag for logical plan file format serialization. |
datafusion/proto/src/generated/datafusion_proto_common.rs |
Regenerated protobuf Rust bindings to add format_array to JsonOptions. |
datafusion/proto-common/src/to_proto/mod.rs |
Includes format_array when converting JsonOptions to protobuf for common proto utilities. |
datafusion/proto-common/src/generated/prost.rs |
Regenerated prost definitions to add the format_array field to JsonOptions. |
datafusion/proto-common/src/generated/pbjson.rs |
Extends JSON (de)serialization of JsonOptions to handle the format_array field. |
datafusion/proto-common/src/from_proto/mod.rs |
Maps the protobuf format_array flag back into JsonOptions. |
datafusion/proto-common/proto/datafusion_common.proto |
Adds bool format_array = 4; to the JsonOptions message definition. |
datafusion/datasource-json/src/source.rs |
Threads format_array through JsonOpener/JsonSource and adds read_json_array_to_batches; array files are read by loading the full array, converting to NDJSON, and delegating to Arrow’s JSON reader. |
datafusion/datasource-json/src/file_format.rs |
Updates JsonFormat docs and behavior to support array mode, adds with_format_array, implements infer_json_schema_from_json_array, and passes format_array into JsonSource. |
datafusion/datasource-json/Cargo.toml |
Adds serde_json as a dependency to support JSON array parsing. |
datafusion/core/tests/data/json_empty_array.json |
Provides an empty JSON array test file used by schema inference tests. |
datafusion/core/tests/data/json_array.json |
Provides a sample JSON array file used by sqllogictests. |
datafusion/core/src/datasource/file_format/json.rs |
Adds tests covering array-format JSON: schema inference, empty array behavior, inference limit, data reading, and projection handling. |
datafusion/common/src/config.rs |
Extends JsonOptions config namespace with a documented format_array flag and describes NDJSON vs array formats. |
datafusion-examples/examples/custom_data_source/csv_json_opener.rs |
Updates the custom JsonOpener::new example to pass the new format_array parameter (set to false for NDJSON). |
Cargo.lock |
Updates lockfile to account for the new serde_json dependency in datasource-json. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Which issue does this PR close?
Closes #19920
Rationale for this change
DataFusion currently only supports line-delimited JSON (NDJSON) format. Many data sources provide JSON in array format
[{...}, {...}], which cannot be parsed by the existing implementation.What changes are included in this PR?
format_arrayoption toJsonOptionsto support JSON array formatOPTIONS ('format.format_array' 'true')Are these changes tested?
Yes:
Are there any user-facing changes?
Yes. Users can now read JSON array format files by specifying the
format.format_arrayoption: