How can I split the Dataset using timestamp feature

Hello,

I have timeseries data in csv file which I am loading using following code.

from datasets import load_dataset

dataset = load_dataset("csv", data_files="mobile_4hr.csv")

I know for other dataset we can randomly split like ds.train_test_split(test_size=0.3)

But for timeseries data how can I split into train and test using specific date?

I see two solutions:

  • cast the date column to Value("date64"), sort the column and split the dataset
    from datasets import load_dataset, Value
    from datetime import datetime
    ds = load_dataset(...)
    features = ds.features
    features["date"] = Value("date64")
    ds = ds.map(lambda ex: {"date": datetime.strptime(ex["date"], date_format)}, features=features)
    ds = ds.sort(date)
    train_size = 0.7
    train_ds = ds.select(range(int(train_size * len(ds))))
    test_ds = ds.select(range(int(train_size * len(ds)), len(ds)))
    
  • iterate over the dataset rows and store their indices, train_ds_idx or test_ds_idx, based on the value of date and then run train_ds = ds.select(train_ds_idx) and test_ds = ds.select(test_ds_idx) to build the splits