feat(spark): add spark random functions #19908

cht42 · 2026-01-20T13:49:13Z

Which issue does this PR close?

Rationale for this change

Implement spark random functions:

What changes are included in this PR?

New spark random functions

Are these changes tested?

yes in slt

Are there any user-facing changes?

yes

…ument types

cht42 · 2026-01-20T13:49:40Z

for reference the spark random functions are implemented here https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/randomExpressions.scala

…math::random

Jefffrey

Haven't fully reviewed the PR yet, but this reminds me of

#17686

Is this something we'll need to be concerned about? In how seed is treated across record batches

Jefffrey · 2026-01-21T04:02:39Z

datafusion/spark/src/function/math/random.rs

+            ColumnarValue::Scalar(ScalarValue::Int64(None)) => 0,
+            _ => {
+                return exec_err!(
+                    "`{}` function expects an Int64 seed argument",


Suggested change

"`{}` function expects an Int64 seed argument",

"`{}` function expects a constant Int64 seed argument",

cht42 · 2026-01-21T05:49:14Z

Haven't fully reviewed the PR yet, but this reminds me of

support seed for random function #17686

Is this something we'll need to be concerned about? In how seed is treated across record batches

oh yea, that will be an issue...

I'm curious if the RecordBatch concept in Datafusion is a direct equivalent of a partition in Spark ? what i mean is can we expect the same determinism in record batches as partitions in spark ? If not, then we can use some internal state in the UDF to avoid the same seed across batches (AtomicU64 we would increment on every invocation ?)

martin-g · 2026-01-21T07:10:30Z

datafusion/spark/src/function/math/random.rs

+    }
+
+    fn invoke_with_args(&self, args: ScalarFunctionArgs) -> Result<ColumnarValue> {
+        let [seed] = take_function_args(self.name(), args.args)?;


What happens if the UDF is called without any arguments ?!
The seed is optional - https://github.com/apache/datafusion/pull/19908/changes#diff-ba2db3d7d9cf4ab01c5a4186821e481786accc221e203b3a31450f8ca6d5c473R59

randr impl below is checking the args length: https://github.com/apache/datafusion/pull/19908/changes#diff-ba2db3d7d9cf4ab01c5a4186821e481786accc221e203b3a31450f8ca6d5c473R191-R194

if there is no seed provided we default to the Datafusion random implementation with the simplify call

martin-g · 2026-01-21T07:23:18Z

datafusion/spark/src/function/math/random.rs

+            }
+            _ => {
+                return exec_err!(
+                    "`{}` function expects a positive Int32 length argument",


Spark (v4.0.1) behaves a little bit different:

Length>0 (OK):

spark-sql (default)> SELECT randstr(3, 0) AS result; ceV Time taken: 0.913 seconds, Fetched 1 row(s)

Length=0 (OK, empty string):

spark-sql (default)> SELECT randstr(0, 0) AS result; Time taken: 0.043 seconds, Fetched 1 row(s)

Length<0 (NOK):

spark-sql (default)> SELECT randstr(-1, 0) AS result; [INVALID_PARAMETER_VALUE.LENGTH] The value of parameter(s) `length` in `randstr` is invalid: Expects `length` greater than or equal to 0, but got -1. SQLSTATE: 22023 org.apache.spark.SparkRuntimeException: [INVALID_PARAMETER_VALUE.LENGTH] The value of parameter(s) `length` in `randstr` is invalid: Expects `length` greater than or equal to 0, but got -1. SQLSTATE: 22023

updated for length = 0

martin-g · 2026-01-21T07:29:39Z

datafusion/spark/src/function/math/random.rs

+    }
+
+    fn invoke_with_args(&self, _args: ScalarFunctionArgs) -> Result<ColumnarValue> {
+        internal_err!("`invoke_with_args` is not implemented for {}", self.name())


Is it safe to always depend on simplify() ?
I think other udfs implement both invoke_with_args() and simplify(). This way it still works if PhysicalExprSimplifier is not used (disabled optimizer) or if the UDF is called directly.

yes i think we can consider it safe, there are a bunch of places where only simplify is implemented https://github.com/search?q=repo%3Aapache%2Fdatafusion+ExprSimplifyResult+language%3ARust+path%3A%2F%5Edatafusion%5C%2Ffunctions%5C%2Fsrc%5C%2F%2F&type=code

martin-g · 2026-01-21T07:43:46Z

datafusion/spark/src/function/math/random.rs

+
+        Ok(ExprSimplifyResult::Simplified(
+            min.clone()
+                .add((max.sub(min)).mul(rand_expr))


Would it be possible that min is a volatile Expr ? E.g. random().
In that case the two usages above will evaluate to two different values.

min should be a literal, will update the logic to verify that

martin-g · 2026-01-21T09:20:20Z

datafusion/sqllogictest/test_files/spark/math/random.slt

+SELECT random(0::integer);
+----
+0.324575268031407
+


Suggested change

query B

SELECT random() > 0;

----

true

to test random() without a seed

Update: Actually random() is used below, so it is tested. I guess it works due to the simplification but it would fail to read the seed when executed directly, i.e. without simplification.

…andom functions

Jefffrey · 2026-01-21T15:19:33Z

I'm curious if the RecordBatch concept in Datafusion is a direct equivalent of a partition in Spark ? what i mean is can we expect the same determinism in record batches as partitions in spark ?

Might need someone from comet or sail to chip in, they might be more familiar with how concepts map between DataFusion and Spark

If not, then we can use some internal state in the UDF to avoid the same seed across batches (AtomicU64 we would increment on every invocation ?)

This could be a good stop-gap solution in the meantime 🤔

cht42 added 5 commits January 20, 2026 16:30

feat: add rand_distr dependency to Cargo.toml files

ee74e2a

feat(spark): Add random functions

6631390

test(spark): test random functions

51efae4

feat(spark): add uniform function

c833133

fix(spark): update error messages for random functions to specify arg…

fc6f9a6

…ument types

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) spark labels Jan 20, 2026

fix(spark): update SparkRandom instantiation to use crate::function::…

7536ade

…math::random

Jefffrey reviewed Jan 21, 2026

View reviewed changes

martin-g reviewed Jan 21, 2026

View reviewed changes

cht42 added 2 commits January 21, 2026 14:07

fix

a28defa

fix(spark): update error messages to specify constant arguments for r…

b5870f3

…andom functions

	"`{}` function expects an Int64 seed argument",
	"`{}` function expects a constant Int64 seed argument",

+query B
+SELECT random() > 0;
+----
+true

feat(spark): add spark random functions #19908

Are you sure you want to change the base?

feat(spark): add spark random functions #19908

Conversation

cht42 commented Jan 20, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

cht42 commented Jan 20, 2026

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cht42 commented Jan 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-g Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jefffrey commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

martin-g Jan 21, 2026 •

edited

Loading