How can I split the Dataset using timestamp feature

sociopath00 · June 16, 2023, 6:41am

Hello,

I have timeseries data in csv file which I am loading using following code.

from datasets import load_dataset

dataset = load_dataset("csv", data_files="mobile_4hr.csv")

I know for other dataset we can randomly split like ds.train_test_split(test_size=0.3)

But for timeseries data how can I split into train and test using specific date?

mariosasko · June 16, 2023, 4:27pm

I see two solutions:

cast the date column to Value("date64"), sort the column and split the dataset

from datasets import load_dataset, Value
from datetime import datetime
ds = load_dataset(...)
features = ds.features
features["date"] = Value("date64")
ds = ds.map(lambda ex: {"date": datetime.strptime(ex["date"], date_format)}, features=features)
ds = ds.sort(date)
train_size = 0.7
train_ds = ds.select(range(int(train_size * len(ds))))
test_ds = ds.select(range(int(train_size * len(ds)), len(ds)))

iterate over the dataset rows and store their indices, train_ds_idx or test_ds_idx, based on the value of date and then run train_ds = ds.select(train_ds_idx) and test_ds = ds.select(test_ds_idx) to build the splits

Topic		Replies	Views
Time Series Transformers: create Train and Test sets 🤗Transformers	2	1491	July 26, 2023
`train_test_split` with IterableDataset 🤗Datasets	2	1850	January 26, 2023
Correct formatting of Multi-Features Time Series dataset 🤗Datasets	4	737	July 19, 2023
Loading simple csv data for time series transformer Beginners	1	1011	October 30, 2023
GluonTS notebook for correctly formatting Time Series Datasets for the Hub 🤗Datasets	6	1709	August 1, 2023

How can I split the Dataset using timestamp feature

Related topics