Hello Everyone,
I am working on a Time Series Forecasting task with a .csv file containing in the order of millions of rows. I am trying to use the TimeSeriesForecasting notebook to establish a baseline. I am trying to use the following datasets notebook to put my dataframe in the appropriate format for a huggingface Time Series dataset.
The following is approwimately the structure of my dataframe:
- index (COHORT_MONTH): the index of the datarame is the month of the year so “freq=M”
- MSISDN (item_id): int64 ====> unique phone number identifier
- TOT_DATA_REV (target): float32 ====> the target variable: real dynamic feature
- TOT_OUTG_REV: float32 ====> independent variable: real dynamic feature
- TOT_VOICE_REV: float32 ====> independent variable: real dynamic feature
- NB_OUTG_OFFNET_SMS: int16 ====> independent variable: dynamic feature
- NB_OUTG_INTER_SMS: int16 ====> independent variable: dynamic feature
The dataframe actually has in reality 31 dynamic features in total.
Since all of the time series start at the same month and some have missing time steps, I used the following code from GluonTS pandasdataframe Tutorial
max_end = max(df.groupby("MSISDN").apply(lambda _df: _df.index[-1]))
dfs_dict = {}
for item_id, gdf in df.groupby("MSISDN"):
new_index = pd.date_range(gdf.index[0], end=max_end, freq="1D")
dfs_dict[item_id] = gdf.reindex(new_index).drop("MSIDN", axis=1)
ds = PandasDataset(dfs_dict, target="target")
My confusion comes from this code snippet in the hugging face tutorial:
from datasets import Dataset, Features, Value, Sequence
features = Features(
{
"start": Value("timestamp[s]"),
"target": Sequence(Value("float32")),
"feat_static_cat": Sequence(Value("uint64")),
# "feat_static_real": Sequence(Value("float32")),
# "feat_dynamic_real": Sequence(Sequence(Value("uint64"))),
# "feat_dynamic_cat": Sequence(Sequence(Value("uint64"))),
"item_id": Value("string"),
}
)
From my understanding my case would only be concerned with start, target, feat_dynamic_real.
NOW ARE MY QUESTIONS
1- do I understand correctly when I think I have to create a dictionary with keys: start: min(COHORT_MONTH), target(TOT_DATA_REV), feat_dynamic_real: [ * 31 real dynamic features], item_id: MSISDN
I tried doing the above and it is excruciatingly slow 10000 MSISDN/40min when I have about 5millions
2- could just creating a Feature for each of my features work?