Skip to content
Open
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 21 additions & 6 deletions flaml/automl.py
Original file line number Diff line number Diff line change
Expand Up @@ -1026,12 +1026,27 @@ def _validate_ts_data(
assert (
dataframe[[dataframe.columns[0]]].duplicated() is None
), "Duplicate timestamp values with different values for other columns."
ts_series = pd.to_datetime(dataframe[dataframe.columns[0]])
inferred_freq = pd.infer_freq(ts_series)
if inferred_freq is None:
logger.warning(
"Missing timestamps detected. To avoid error with estimators, set estimator list to ['prophet']. "
)
if self._state.task == TS_FORECASTPANEL:
# check for each time series independently
group_ids = self._state.fit_kwargs.get("group_ids")
unique_ids = dataframe[group_ids].value_counts().reset_index()[group_ids]
for _, row in unique_ids.iterrows():
df = dataframe.copy()
for id in group_ids:
ts = df.loc[df[id] == row[id]]
ts_series = pd.to_datetime(ts[ts.columns[0]])
inferred_freq = pd.infer_freq(ts_series)
if inferred_freq is None:
logger.warning(
"Missing timestamps detected. To avoid error with estimators, set estimator list to ['prophet']. "
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple problems:

  1. Does prophet handle panel data? Does TFT handle missing timestamps? If TFT can handle missing timestamps, this check is not necessary. If prophet can't handle panel data, this message shouldn't suggest adding prophet.
  2. For loop is not efficient. There should be a functional way using groupby().
  3. Do you intend to infer the frequency for each series? The current code only does it for the last one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yea, the logger message is incorrect. We need there to be no missing timestamps in order to create the "time_idx" column. See add_time_idx_col(X) function in data.py.

Also looked into TFT handling missing data a little bit more. By default, allow_missing_timesteps for the TimeSeriesDataset object is False, so currently it does not handle missing timesteps. It would be reasonable if we turn it on. Also, another thing to consider now: by default TimeSeriesDataset uses forward fill strategy (fill based on previous data) to handle missing data, but also allow constant_fill_strategy where users input a "dictionary of column names with constants to fill in missing values if there are gaps in the sequence".

Perhaps with this, we should allow for missing timesteps and find a different solution to creating a "time_idx" column. @EgorKraevTransferwise @markharley what do you think? We had a conversation about this before in our first meeting and Egor suggested to just assume no missing time steps for simplicity of code.

  1. I did try to use groupby at first but could not find a way. Will try again.
  2. Yea, indentation was wrong... 1037 to 1042 should be indented.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, TimeSeriesDataset only handle timestamps that are missing, but do not handle NA values.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EgorKraevTransferwise and @markharley Our current solution to this issue will be to either require no missing timestamps or else user have to provide a freq argument.

else:
ts_series = pd.to_datetime(dataframe[dataframe.columns[0]])
inferred_freq = pd.infer_freq(ts_series)
if inferred_freq is None:
logger.warning(
"Missing timestamps detected. To avoid error with estimators, set estimator list to ['prophet']. "
)
if y_train_all is not None:
return dataframe.iloc[:, :-1], dataframe.iloc[:, -1]
return dataframe
Expand Down