Skip to content

Conversation

@cht42
Copy link
Contributor

@cht42 cht42 commented Jan 20, 2026

Which issue does this PR close?

Rationale for this change

Implement spark random functions:

What changes are included in this PR?

New spark random functions

Are these changes tested?

yes in slt

Are there any user-facing changes?

yes

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) spark labels Jan 20, 2026
@cht42
Copy link
Contributor Author

cht42 commented Jan 20, 2026

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't fully reviewed the PR yet, but this reminds me of

Is this something we'll need to be concerned about? In how seed is treated across record batches

ColumnarValue::Scalar(ScalarValue::Int64(None)) => 0,
_ => {
return exec_err!(
"`{}` function expects an Int64 seed argument",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"`{}` function expects an Int64 seed argument",
"`{}` function expects a constant Int64 seed argument",

@cht42
Copy link
Contributor Author

cht42 commented Jan 21, 2026

Haven't fully reviewed the PR yet, but this reminds me of

Is this something we'll need to be concerned about? In how seed is treated across record batches

oh yea, that will be an issue...

I'm curious if the RecordBatch concept in Datafusion is a direct equivalent of a partition in Spark ? what i mean is can we expect the same determinism in record batches as partitions in spark ? If not, then we can use some internal state in the UDF to avoid the same seed across batches (AtomicU64 we would increment on every invocation ?)

}

fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
let [seed] = take_function_args(self.name(), args.args)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is no seed provided we default to the Datafusion random implementation with the simplify call

}
_ => {
return exec_err!(
"`{}` function expects a positive Int32 length argument",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark (v4.0.1) behaves a little bit different:

Length>0 (OK):

spark-sql (default)> SELECT randstr(3, 0) AS result;
ceV
Time taken: 0.913 seconds, Fetched 1 row(s)

Length=0 (OK, empty string):

spark-sql (default)> SELECT randstr(0, 0) AS result;

Time taken: 0.043 seconds, Fetched 1 row(s)

Length<0 (NOK):

spark-sql (default)> SELECT randstr(-1, 0) AS result;
[INVALID_PARAMETER_VALUE.LENGTH] The value of parameter(s) `length` in `randstr` is invalid: Expects `length` greater than or equal to 0, but got -1. SQLSTATE: 22023
org.apache.spark.SparkRuntimeException: [INVALID_PARAMETER_VALUE.LENGTH] The value of parameter(s) `length` in `randstr` is invalid: Expects `length` greater than or equal to 0, but got -1. SQLSTATE: 22023

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated for length = 0

}

fn invoke_with_args(&self, _args: ScalarFunctionArgs) -> Result<ColumnarValue> {
internal_err!("`invoke_with_args` is not implemented for {}", self.name())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to always depend on simplify() ?
I think other udfs implement both invoke_with_args() and simplify(). This way it still works if PhysicalExprSimplifier is not used (disabled optimizer) or if the UDF is called directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Ok(ExprSimplifyResult::Simplified(
min.clone()
.add((max.sub(min)).mul(rand_expr))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible that min is a volatile Expr ? E.g. random().
In that case the two usages above will evaluate to two different values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min should be a literal, will update the logic to verify that

SELECT random(0::integer);
----
0.324575268031407

Copy link
Member

@martin-g martin-g Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
query B
SELECT random() > 0;
----
true

to test random() without a seed

Update: Actually random() is used below, so it is tested. I guess it works due to the simplification but it would fail to read the seed when executed directly, i.e. without simplification.

@Jefffrey
Copy link
Contributor

I'm curious if the RecordBatch concept in Datafusion is a direct equivalent of a partition in Spark ? what i mean is can we expect the same determinism in record batches as partitions in spark ?

Might need someone from comet or sail to chip in, they might be more familiar with how concepts map between DataFusion and Spark

If not, then we can use some internal state in the UDF to avoid the same seed across batches (AtomicU64 we would increment on every invocation ?)

This could be a good stop-gap solution in the meantime 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[datafusion-spark] implement spark random functions

3 participants