Correct formatting of Multi-Features Time Series dataset

admarcosai · July 17, 2023, 11:04am

Hello Everyone,
I am working on a Time Series Forecasting task with a .csv file containing in the order of millions of rows. I am trying to use the TimeSeriesForecasting notebook to establish a baseline. I am trying to use the following datasets notebook to put my dataframe in the appropriate format for a huggingface Time Series dataset.

The following is approwimately the structure of my dataframe:

index (COHORT_MONTH): the index of the datarame is the month of the year so “freq=M”
MSISDN (item_id): int64 ====> unique phone number identifier
TOT_DATA_REV (target): float32 ====> the target variable: real dynamic feature
TOT_OUTG_REV: float32 ====> independent variable: real dynamic feature
TOT_VOICE_REV: float32 ====> independent variable: real dynamic feature
NB_OUTG_OFFNET_SMS: int16 ====> independent variable: dynamic feature
NB_OUTG_INTER_SMS: int16 ====> independent variable: dynamic feature

The dataframe actually has in reality 31 dynamic features in total.

Since all of the time series start at the same month and some have missing time steps, I used the following code from GluonTS pandasdataframe Tutorial

max_end = max(df.groupby("MSISDN").apply(lambda _df: _df.index[-1]))
dfs_dict = {}
for item_id, gdf in df.groupby("MSISDN"):
    new_index = pd.date_range(gdf.index[0], end=max_end, freq="1D")
    dfs_dict[item_id] = gdf.reindex(new_index).drop("MSIDN", axis=1)

ds = PandasDataset(dfs_dict, target="target")

My confusion comes from this code snippet in the hugging face tutorial:

from datasets import Dataset, Features, Value, Sequence

features  = Features(
    {    
        "start": Value("timestamp[s]"),
        "target": Sequence(Value("float32")),
        "feat_static_cat": Sequence(Value("uint64")),
        # "feat_static_real":  Sequence(Value("float32")),
        # "feat_dynamic_real": Sequence(Sequence(Value("uint64"))),
        # "feat_dynamic_cat": Sequence(Sequence(Value("uint64"))),
        "item_id": Value("string"),
    }
)

From my understanding my case would only be concerned with start, target, feat_dynamic_real.

NOW ARE MY QUESTIONS
1- do I understand correctly when I think I have to create a dictionary with keys: start: min(COHORT_MONTH), target(TOT_DATA_REV), feat_dynamic_real: [ * 31 real dynamic features], item_id: MSISDN
I tried doing the above and it is excruciatingly slow 10000 MSISDN/40min when I have about 5millions

2- could just creating a Feature for each of my features work?

lhoestq · July 17, 2023, 1:35pm

2- could just creating a Feature for each of my features work?

Yup that does the job and would fit the notebook better, otherwise you’d end up with a Sequence(Sequence(Value("float32"))) (31 * window_length floating values)

admarcosai · July 18, 2023, 11:15am

Trying to create a separate feature for each column kept giving me errors. It seems I didn’t understand how Dataset.from_list(dict_list, features=features) works as it actually.

I ended up using:

from datasets import Dataset, Features, Value, Sequence
features  = Features(
    {    
        "start": Value("timestamp[s]"),
        "target": Sequence(Value("float32")),
        "item_id": Value("string"),
        "feat_dynamic_real": Sequence(Sequence(Value("uint64"))),
     })

 Dataset.from_list(dict_list, features=features)

A Question Following These
Is there a way to create train and test set from the resulting hugginface dataset without doubly storing the dataset like:
train_set: dataset(2022-01-31 to 2022-12-301
test_set: dataset(2022-01-31 to 2023-04-30)

lhoestq · July 18, 2023, 5:22pm

You can use dataset.select(...) and pass the indices that you’d use for train or test

admarcosai · July 19, 2023, 3:24am

Actually in the Time Series Transformer tutorial I am following the test and train set contain the same samples; only the test samples have more time steps than the train samples. This question might be more appropruate for the Transformers category I think

Topic		Replies	Views
Loading simple csv data for time series transformer Beginners	1	1002	October 30, 2023
Efficiently Format Big DataFrame for Ingestion into Time Series Transformer 🤗Transformers	0	280	July 9, 2023
Problem loading .CSV for Time Series Transformer Beginners	6	790	December 15, 2022
'list' as a feature in huggingface dataset 🤗Datasets	1	1135	May 25, 2023
GluonTS notebook for correctly formatting Time Series Datasets for the Hub 🤗Datasets	6	1699	August 1, 2023

Correct formatting of Multi-Features Time Series dataset

Related topics