Skip to content

Conversation

@kazantsev-maksim
Copy link
Contributor

@kazantsev-maksim kazantsev-maksim commented Jan 25, 2026

Which issue does this PR close?

N/A

Rationale for this change

What changes are included in this PR?

  1. Added native Comet support for Spark's elt function
  2. Uses the SparkElt implementation from the datafusion-spark crate

How are these changes tested?

  1. Added unit test in CometStringExpressionSuite.scala
  2. Added new benchmark test in CometStringExpressionBenchmark.scala

Microbenchmark result:

elt

object CometElt extends CometScalarFunction[Elt]("elt") {

override def getSupportLevel(expr: Elt): SupportLevel = {
if (expr.failOnError) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

object CometElt extends CometScalarFunction[Elt]("elt") {

override def getSupportLevel(expr: Elt): SupportLevel = {
if (expr.failOnError) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @kazantsev-maksim .Minor nitpick : We generally add ANSI mode not supported to make it clear to the user. I can work in enabling support for ANSI mode once this PR is merged as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed


override def getSupportLevel(expr: Elt): SupportLevel = {
if (expr.failOnError) {
return Unsupported(Some("failOnError=true is not supported"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

sql(s"SELECT elt(cast(null as int), ${schema.fieldNames.mkString(",")}) FROM t1"))
checkSparkAnswerMaybeThrows(sql("SELECT elt(1) FROM t1")) match {
case (Some(spark), Some(comet)) =>
assert(spark.getMessage.contains("WRONG_NUM_ARGS.WITHOUT_SUGGESTION"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor : perhaps a good idea to capture the error msg "WRONG_NUM_ARGS.WITHOUT_SUGGESTION" in a string ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed

StringExprConfig("trim", "select trim(c1) from parquetV1Table"),
StringExprConfig("upper", "select upper(c1) from parquetV1Table"))
StringExprConfig("upper", "select upper(c1) from parquetV1Table"),
StringExprConfig("elt", "select elt(2, c1, c1) from parquetV1Table"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for adding this in the benchmark setup

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@coderfender coderfender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Left minor comments

override def getSupportLevel(expr: Elt): SupportLevel = {
if (expr.failOnError) {
return Unsupported(Some("failOnError=true is not supported"))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also might want to check if the inputs are a number and an array to make sure we terminate early in case the user provides malformed input

Copy link
Contributor

@coderfender coderfender Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relevant examples : CometLength , CometRPad , CometLPad

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first argument can be a numeric string - '2', it seems we can't cover all cases here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do a selective support for now (say Int, String) inputs for now and file an issue for further enhancements down the line. WDYT @kazantsev-maksim ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added it. @coderfender Can you please see?

@codecov-commenter
Copy link

codecov-commenter commented Jan 27, 2026

Codecov Report

❌ Patch coverage is 26.31579% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.91%. Comparing base (f09f8af) to head (8e3f482).
⚠️ Report is 907 commits behind head on main.

Files with missing lines Patch % Lines
...rc/main/scala/org/apache/comet/serde/strings.scala 17.64% 10 Missing and 4 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3269      +/-   ##
============================================
+ Coverage     56.12%   59.91%   +3.78%     
- Complexity      976     1462     +486     
============================================
  Files           119      175      +56     
  Lines         11743    16185    +4442     
  Branches       2251     2687     +436     
============================================
+ Hits           6591     9697    +3106     
- Misses         4012     5136    +1124     
- Partials       1140     1352     +212     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


private val WRONG_NUM_ARGS_WITHOUT_SUGGESTION_EXCEPTION_MSG =
"[WRONG_NUM_ARGS.WITHOUT_SUGGESTION] The `elt` requires > 1 parameters but the actual number is 1."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor : given that this message is only applicable for your expression test , may be we could move it inside the test itself ?

.mkString(", ")
return Unsupported(
Some(
s"Parameters 2 and onwards require the string type, but contains: $unsupportedTypes"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor : we could use StringType.toSQL to get SQL test for StringType to make it easier for user's to parse error messages

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor : might be a good idea to not do distinct here ..
Some of the options I could think of are

  1. Exhaustive error message : elt(1 , "apple", "banana", 10 , true , "StringType") will produce Parameters 2 and onwards require the string type, but contains: StringType, IntegerType , BooleanType and could keep the user guessing . However, if we remove distinct and preserve order , it might be Parameters 2 and onwards require the string type, but datatypes supplied are StringType, IntegerType , BooleanType , StringType

  2. Break at first non string type and return unsupported to reduce redundancy

But we could do that in a follow up PR by raising an issue on github too

Copy link
Contributor

@coderfender coderfender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left minor comments

.mkString(", ")
return Unsupported(
Some(
s"Parameters 2 and onwards require the string type, but contains: $unsupportedTypes"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor : might be a good idea to not do distinct here ..
Some of the options I could think of are

  1. Exhaustive error message : elt(1 , "apple", "banana", 10 , true , "StringType") will produce Parameters 2 and onwards require the string type, but contains: StringType, IntegerType , BooleanType and could keep the user guessing . However, if we remove distinct and preserve order , it might be Parameters 2 and onwards require the string type, but datatypes supplied are StringType, IntegerType , BooleanType , StringType

  2. Break at first non string type and return unsupported to reduce redundancy

But we could do that in a follow up PR by raising an issue on github too

assert(comet.getMessage.contains(WRONG_NUM_ARGS_WITHOUT_SUGGESTION_EXCEPTION_MSG))
case (spark, comet) =>
fail(
s"Expected Spark and Comet to throw exception, but got\nSpark: $spark\nComet: $comet")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great if you can add more tests to cover other code paths

  1. Non string data types from pos 2
  2. ANSI mode should fall back successfully to Spark
  3. Index out of range and make sure we return NULL
  4. Verify that first arg being Int is being enforced properly (perhaps pass a string / long type input as a negative test)

"idx",
lit(Random.shuffle(indexes ++ edgeCasesIndexes).headOption.getOrElse(-1)))
.createOrReplaceTempView("t1")
checkSparkAnswerAndOperator(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be worth adding a literal test with constant folding disabled, something like:

                                                                                                                                                                                 
  withSQLConf(SQLConf.OPTIMIZER_EXCLUDED_RULES.key ->                                                                                                                             
      "org.apache.spark.sql.catalyst.optimizer.ConstantFolding") {                                                                                                                
    checkSparkAnswerAndOperator("SELECT elt(2, 'a', 'b', 'c')")          

It is totally fine to fall back to Spark if all inputs are literal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants