Skip to content

[Bug]: AutoML time series: XGBoost prediction fails with “feature names should match” (date column) #1480

@carolinsmilie-cmd

Description

@carolinsmilie-cmd

Describe the bug

Hi everyone,
I’m trying AutoML for time series forecasting for the first time and I’m stuck on an error. Any help would be greatly appreciated!

I trained two models for two different target variables and now want to predict the next three months for each.
Because I couldn’t train on daily data, I aggregated dates to the first day of each month.
Both the training and prediction tables come from the same file, and their column names match.
The model with the Extra Trees learner runs predictions without issues.
However, the XGBoost model fails with:

ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time: - date

What’s confusing:

The MLflow model signature shows date as an expected input.
I tried both date-only and datetime formats for the column, but the error persists.

Any ideas on how to resolve this feature-name mismatch for XGBoost (especially around the date column) would be amazing.
Thank you!

Steps to reproduce

df = spark.read.format("delta").load(
"abfss://@onelake.dfs.fabric.microsoft.com//Tables/

"
)

model = MLFlowTransformer(
inputCols=[ "date", "feature_1", "feature_2", ... ],
outputCol="target_prediction",
modelName="",
modelVersion=
)
df = model.transform(df)

df.write.format('delta').mode("overwrite").save(
"abfss://@onelake.dfs.fabric.microsoft.com//Tables/"
)

Error Message:

File "/home/trusted-service-user/cluster-env/trident_env/lib/python3.11/site-packages/mlflow/pyfunc/init.py", line 716, in predict
return self._predict_fn(data, params=params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/trusted-service-user/cluster-env/trident_env/lib/python3.11/site-packages/mlflow/sklearn/init.py", line 543, in predict
return self.sklearn_model.predict(data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/trusted-service-user/cluster-env/trident_env/lib/python3.11/site-packages/sklearn/pipeline.py", line 600, in predict
Xt = transform.transform(Xt)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/trusted-service-user/cluster-env/trident_env/lib/python3.11/site-packages/sklearn/utils/_set_output.py", line 313, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/trusted-service-user/cluster-env/trident_env/lib/python3.11/site-packages/flaml/fabric/autofe.py", line 444, in transform
return self._transform(X)
^^^^^^^^^^^^^^^^^^
File "/home/trusted-service-user/cluster-env/trident_env/lib/python3.11/site-packages/flaml/fabric/autofe.py", line 403, in _transform
raw_res = self.pipeline.transform(X)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/trusted-service-user/cluster-env/trident_env/lib/python3.11/site-packages/sklearn/pipeline.py", line 903, in transform
Xt = transform.transform(Xt, **routed_params[name].transform)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/trusted-service-user/cluster-env/trident_env/lib/python3.11/site-packages/sklearn/utils/_set_output.py", line 313, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/trusted-service-user/cluster-env/trident_env/lib/python3.11/site-packages/sklearn/decomposition/_base.py", line 143, in transform
X = self._validate_data(
^^^^^^^^^^^^^^^^^^^^
File "/home/trusted-service-user/cluster-env/trident_env/lib/python3.11/site-packages/sklearn/base.py", line 608, in _validate_data
self._check_feature_names(X, reset=reset)
File "/home/trusted-service-user/cluster-env/trident_env/lib/python3.11/site-packages/sklearn/base.py", line 535, in _check_feature_names
raise ValueError(message)
ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:

  • date

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:572)
    at org.apache.spark.sql.execution.python.PythonArrowOutput$$anon$1.read(PythonArrowOutput.scala:118)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
    at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithIterator(FileFormatDataWriter.scala:121)
    at org.apache.spark.sql.delta.files.DeltaFileFormatWriter$.$anonfun$executeTask$3(DeltaFileFormatWriter.scala:603)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1397)
    at org.apache.spark.sql.delta.files.DeltaFileFormatWriter$.executeTask(DeltaFileFormatWriter.scala:611)
    ... 12 more

Model Used

No response

Expected Behavior

No response

Screenshots and logs

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneed more infoCan't address without more information

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions