I am trying to use a Time Series Transformer on a huge multivariate dataset of 101,049,768 rows.
The dataset is structured as follows:
- MONTH: goes from ‘2021-12-01’ to ‘2023-04-01’; with 17 unique dates being the first date of each month in that interval
- MSISDN: includes “5,944,104” unique phone number; which are repeated for each month
- TOT_OUTG_REV: which includes the total revenue (a real number) generated by the MSISDN in the particular MONTH variable
- TOT_DATA_REV: which includes the total data revenue (a real number) generated by the MSISDN in the particular MONTH variable; this is the target variable
- NB_OUTG_SMS: which includes the number of SMS (an int) sent by MSISDN
- COMMUNE: which includes the city (a categorical variable) the MSISDN is located in
- OCCUPATION: which includes the profession (a categorical variable) of the MSISDN
- REG_DATE: which includes the registration date of the MSISDN
Here is a sample of the code I wrote to format the dataset
CODE:
pers_loc_rev_sample_hf = []
start_date = pers_loc_rev["COHORT_MONTH"].min()
for i, msisdn in enumerate(pers_loc_rev_apr23_msisdn[8470:]):
hf_dict = {}
hf_dict["MSISDN"] = msisdn
hf_dict["START_DATE"] = pers_loc_rev[pers_loc_rev["MSISDN"]==msisdn]["COHORT_MONTH"].min()
hf_dict["TARGET"] = pers_loc_rev[pers_loc_rev["MSISDN"]==msisdn]["TOT_DATA_REV"].values
hf_dict["FEAT_STATIC_CAT"] = []
hf_dict["FEAT_DYNAMIC_REAL"] = []
for dyn_col in dynamic_real_cols:
hf_dict["FEAT_DYNAMIC_REAL"].append(pers_loc_rev[pers_loc_rev["MSISDN"]==msisdn][dyn_col].values)
for stat_col in static_cat_cols:
hf_dict["FEAT_STATIC_CAT"].append(pers_loc_rev[pers_loc_rev["MSISDN"]==msisdn][stat_col].values)
if i==10000 : break
pers_loc_rev_sample_hf.append(hf_dict)
from datasets import Dataset, Features, Value, Sequence
features = Features(
{
"START_DATE": Value("timestamp[s]"),
"TARGET": Sequence(Value("float32")),
"FEAT_STATIC_CAT": Sequence(Value("string")),
"FEAT_DYNAMIC_REAL": Sequence(Sequence(Value("float32"))),
"MSISDN": Value("string"),
}
)
# pers_loc_rev_sample_hf
dataset = Dataset.from_list(pers_loc_rev_sample_hf, features=features)
This code is extremely slow for a 101,049,768 rows dataframe, does anyone have a recommendation on how to do these efficient?