Skip to content

Conversation

@zhuqi-lucas
Copy link
Contributor

Which issue does this PR close?

Closes #19920

Rationale for this change

DataFusion currently only supports line-delimited JSON (NDJSON) format. Many data sources provide JSON in array format [{...}, {...}], which cannot be parsed by the existing implementation.

What changes are included in this PR?

  • Add format_array option to JsonOptions to support JSON array format
  • Implement schema inference for JSON array format
  • Implement data reading for JSON array format
  • Add protobuf serialization support for the new option
  • SQL support via OPTIONS ('format.format_array' 'true')

Are these changes tested?

Yes:

  • Unit tests for schema inference (normal, empty array, with limit)
  • Unit tests for data reading and projection
  • SQLLogicTest for end-to-end SQL validation

Are there any user-facing changes?

Yes. Users can now read JSON array format files by specifying the format.format_array option:

CREATE EXTERNAL TABLE my_table
STORED AS JSON
OPTIONS ('format.format_array' 'true')
LOCATION 'path/to/array.json';

Copilot AI review requested due to automatic review settings January 21, 2026 08:40
@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate datasource Changes to the datasource crate labels Jan 21, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends DataFusion’s JSON support to handle files where the top-level value is a JSON array ([{...}, {...}]), in addition to the existing newline-delimited JSON (NDJSON) format.

Changes:

  • Adds a format_array: bool option to JsonOptions (config, protobuf, and JSON (de)serialization) and wires it through JsonFormat/JsonSource into the JSON execution path.
  • Implements array-aware schema inference and reading in datasource-json, including helper functions to infer schemas and read array JSON into RecordBatches, plus updates examples and tests (unit tests and sqllogictests).
  • Adds new test data (json_array.json, json_empty_array.json) and SQLLogic tests to validate array format behavior and the new OPTIONS ('format.format_array' 'true') flag.

Reviewed changes

Copilot reviewed 13 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
datafusion/sqllogictest/test_files/json.slt Adds end-to-end SQL tests for JSON array input, including a failure case when format_array is not set.
datafusion/proto/src/logical_plan/file_formats.rs Extends JsonOptionsProtoJsonOptions mapping to include the format_array flag for logical plan file format serialization.
datafusion/proto/src/generated/datafusion_proto_common.rs Regenerated protobuf Rust bindings to add format_array to JsonOptions.
datafusion/proto-common/src/to_proto/mod.rs Includes format_array when converting JsonOptions to protobuf for common proto utilities.
datafusion/proto-common/src/generated/prost.rs Regenerated prost definitions to add the format_array field to JsonOptions.
datafusion/proto-common/src/generated/pbjson.rs Extends JSON (de)serialization of JsonOptions to handle the format_array field.
datafusion/proto-common/src/from_proto/mod.rs Maps the protobuf format_array flag back into JsonOptions.
datafusion/proto-common/proto/datafusion_common.proto Adds bool format_array = 4; to the JsonOptions message definition.
datafusion/datasource-json/src/source.rs Threads format_array through JsonOpener/JsonSource and adds read_json_array_to_batches; array files are read by loading the full array, converting to NDJSON, and delegating to Arrow’s JSON reader.
datafusion/datasource-json/src/file_format.rs Updates JsonFormat docs and behavior to support array mode, adds with_format_array, implements infer_json_schema_from_json_array, and passes format_array into JsonSource.
datafusion/datasource-json/Cargo.toml Adds serde_json as a dependency to support JSON array parsing.
datafusion/core/tests/data/json_empty_array.json Provides an empty JSON array test file used by schema inference tests.
datafusion/core/tests/data/json_array.json Provides a sample JSON array file used by sqllogictests.
datafusion/core/src/datasource/file_format/json.rs Adds tests covering array-format JSON: schema inference, empty array behavior, inference limit, data reading, and projection handling.
datafusion/common/src/config.rs Extends JsonOptions config namespace with a documented format_array flag and describes NDJSON vs array formats.
datafusion-examples/examples/custom_data_source/csv_json_opener.rs Updates the custom JsonOpener::new example to pass the new format_array parameter (set to false for NDJSON).
Cargo.lock Updates lockfile to account for the new serde_json dependency in datasource-json.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support JSON arrays reader/parse for datafusion

1 participant