Efficiently Format Big DataFrame for Ingestion into Time Series Transformer

I am trying to use a Time Series Transformer on a huge multivariate dataset of 101,049,768 rows.
The dataset is structured as follows:

  • MONTH: goes from ‘2021-12-01’ to ‘2023-04-01’; with 17 unique dates being the first date of each month in that interval
  • MSISDN: includes “5,944,104” unique phone number; which are repeated for each month
  • TOT_OUTG_REV: which includes the total revenue (a real number) generated by the MSISDN in the particular MONTH variable
  • TOT_DATA_REV: which includes the total data revenue (a real number) generated by the MSISDN in the particular MONTH variable; this is the target variable
  • NB_OUTG_SMS: which includes the number of SMS (an int) sent by MSISDN
  • COMMUNE: which includes the city (a categorical variable) the MSISDN is located in
  • OCCUPATION: which includes the profession (a categorical variable) of the MSISDN
  • REG_DATE: which includes the registration date of the MSISDN

Here is a sample of the code I wrote to format the dataset

pers_loc_rev_sample_hf = []
start_date = pers_loc_rev["COHORT_MONTH"].min()
for i, msisdn in enumerate(pers_loc_rev_apr23_msisdn[8470:]):
    hf_dict = {}
    hf_dict["MSISDN"] = msisdn
    hf_dict["START_DATE"] = pers_loc_rev[pers_loc_rev["MSISDN"]==msisdn]["COHORT_MONTH"].min()
    hf_dict["TARGET"] = pers_loc_rev[pers_loc_rev["MSISDN"]==msisdn]["TOT_DATA_REV"].values
    hf_dict["FEAT_STATIC_CAT"] = []
    hf_dict["FEAT_DYNAMIC_REAL"] = []
    for dyn_col in dynamic_real_cols:
    for stat_col in static_cat_cols:
    if i==10000 : break   
from datasets import Dataset, Features, Value, Sequence

features  = Features(
        "START_DATE": Value("timestamp[s]"),
        "TARGET": Sequence(Value("float32")),
        "FEAT_STATIC_CAT": Sequence(Value("string")),
        "FEAT_DYNAMIC_REAL": Sequence(Sequence(Value("float32"))),
        "MSISDN": Value("string"),
# pers_loc_rev_sample_hf
dataset = Dataset.from_list(pers_loc_rev_sample_hf, features=features)

This code is extremely slow for a 101,049,768 rows dataframe, does anyone have a recommendation on how to do these efficient?