Correct formatting of Multi-Features Time Series dataset

Hello Everyone,
I am working on a Time Series Forecasting task with a .csv file containing in the order of millions of rows. I am trying to use the TimeSeriesForecasting notebook to establish a baseline. I am trying to use the following datasets notebook to put my dataframe in the appropriate format for a huggingface Time Series dataset.

The following is approwimately the structure of my dataframe:

  • index (COHORT_MONTH): the index of the datarame is the month of the year so “freq=M”
  • MSISDN (item_id): int64 ====> unique phone number identifier
  • TOT_DATA_REV (target): float32 ====> the target variable: real dynamic feature
  • TOT_OUTG_REV: float32 ====> independent variable: real dynamic feature
  • TOT_VOICE_REV: float32 ====> independent variable: real dynamic feature
  • NB_OUTG_OFFNET_SMS: int16 ====> independent variable: dynamic feature
  • NB_OUTG_INTER_SMS: int16 ====> independent variable: dynamic feature

The dataframe actually has in reality 31 dynamic features in total.

Since all of the time series start at the same month and some have missing time steps, I used the following code from GluonTS pandasdataframe Tutorial

max_end = max(df.groupby("MSISDN").apply(lambda _df: _df.index[-1]))
dfs_dict = {}
for item_id, gdf in df.groupby("MSISDN"):
    new_index = pd.date_range(gdf.index[0], end=max_end, freq="1D")
    dfs_dict[item_id] = gdf.reindex(new_index).drop("MSIDN", axis=1)

ds = PandasDataset(dfs_dict, target="target")

My confusion comes from this code snippet in the hugging face tutorial:

from datasets import Dataset, Features, Value, Sequence

features  = Features(
    {    
        "start": Value("timestamp[s]"),
        "target": Sequence(Value("float32")),
        "feat_static_cat": Sequence(Value("uint64")),
        # "feat_static_real":  Sequence(Value("float32")),
        # "feat_dynamic_real": Sequence(Sequence(Value("uint64"))),
        # "feat_dynamic_cat": Sequence(Sequence(Value("uint64"))),
        "item_id": Value("string"),
    }
)

From my understanding my case would only be concerned with start, target, feat_dynamic_real.

NOW ARE MY QUESTIONS :hugs:
1- do I understand correctly when I think I have to create a dictionary with keys: start: min(COHORT_MONTH), target(TOT_DATA_REV), feat_dynamic_real: [ * 31 real dynamic features], item_id: MSISDN
I tried doing the above and it is excruciatingly slow 10000 MSISDN/40min when I have about 5millions

2- could just creating a Feature for each of my features work?

2- could just creating a Feature for each of my features work?

Yup that does the job and would fit the notebook better, otherwise you’d end up with a Sequence(Sequence(Value("float32"))) (31 * window_length floating values)

1 Like

Trying to create a separate feature for each column kept giving me errors. It seems I didn’t understand how Dataset.from_list(dict_list, features=features) works as it actually.

I ended up using:

from datasets import Dataset, Features, Value, Sequence
features  = Features(
    {    
        "start": Value("timestamp[s]"),
        "target": Sequence(Value("float32")),
        "item_id": Value("string"),
        "feat_dynamic_real": Sequence(Sequence(Value("uint64"))),
     })

 Dataset.from_list(dict_list, features=features)

A Question Following These
Is there a way to create train and test set from the resulting hugginface dataset without doubly storing the dataset like:
train_set: dataset(2022-01-31 to 2022-12-301
test_set: dataset(2022-01-31 to 2023-04-30)

You can use dataset.select(...) and pass the indices that you’d use for train or test

Actually in the Time Series Transformer tutorial I am following the test and train set contain the same samples; only the test samples have more time steps than the train samples. This question might be more appropruate for the Transformers category I think