Efficiently Format Big DataFrame for Ingestion into Time Series Transformer

admarcosai · July 9, 2023, 4:15pm

I am trying to use a Time Series Transformer on a huge multivariate dataset of 101,049,768 rows.
The dataset is structured as follows:

MONTH: goes from ‘2021-12-01’ to ‘2023-04-01’; with 17 unique dates being the first date of each month in that interval
MSISDN: includes “5,944,104” unique phone number; which are repeated for each month
TOT_OUTG_REV: which includes the total revenue (a real number) generated by the MSISDN in the particular MONTH variable
TOT_DATA_REV: which includes the total data revenue (a real number) generated by the MSISDN in the particular MONTH variable; this is the target variable
NB_OUTG_SMS: which includes the number of SMS (an int) sent by MSISDN
COMMUNE: which includes the city (a categorical variable) the MSISDN is located in
OCCUPATION: which includes the profession (a categorical variable) of the MSISDN
REG_DATE: which includes the registration date of the MSISDN

Here is a sample of the code I wrote to format the dataset
CODE:

pers_loc_rev_sample_hf = []
start_date = pers_loc_rev["COHORT_MONTH"].min()
for i, msisdn in enumerate(pers_loc_rev_apr23_msisdn[8470:]):
    hf_dict = {}
    hf_dict["MSISDN"] = msisdn
    hf_dict["START_DATE"] = pers_loc_rev[pers_loc_rev["MSISDN"]==msisdn]["COHORT_MONTH"].min()
    hf_dict["TARGET"] = pers_loc_rev[pers_loc_rev["MSISDN"]==msisdn]["TOT_DATA_REV"].values
    hf_dict["FEAT_STATIC_CAT"] = []
    hf_dict["FEAT_DYNAMIC_REAL"] = []
    for dyn_col in dynamic_real_cols:
        hf_dict["FEAT_DYNAMIC_REAL"].append(pers_loc_rev[pers_loc_rev["MSISDN"]==msisdn][dyn_col].values)
    
    for stat_col in static_cat_cols:
        hf_dict["FEAT_STATIC_CAT"].append(pers_loc_rev[pers_loc_rev["MSISDN"]==msisdn][stat_col].values)
    if i==10000 : break   
    pers_loc_rev_sample_hf.append(hf_dict)

from datasets import Dataset, Features, Value, Sequence

features  = Features(
    {    
        "START_DATE": Value("timestamp[s]"),
        "TARGET": Sequence(Value("float32")),
        "FEAT_STATIC_CAT": Sequence(Value("string")),
        "FEAT_DYNAMIC_REAL": Sequence(Sequence(Value("float32"))),
        "MSISDN": Value("string"),
    }
)
# pers_loc_rev_sample_hf
dataset = Dataset.from_list(pers_loc_rev_sample_hf, features=features)

This code is extremely slow for a 101,049,768 rows dataframe, does anyone have a recommendation on how to do these efficient?

Topic		Replies	Views
Correct formatting of Multi-Features Time Series dataset 🤗Datasets	4	731	July 19, 2023
Loading simple csv data for time series transformer Beginners	1	998	October 30, 2023
Problem loading .CSV for Time Series Transformer Beginners	6	790	December 15, 2022
Creating a Dataset object from large pandas dataframe 🤗Datasets	3	1883	July 21, 2022
Time Series Transformers: create Train and Test sets 🤗Transformers	2	1474	July 26, 2023

Efficiently Format Big DataFrame for Ingestion into Time Series Transformer

Related topics