GluonTS notebook for correctly formatting Time Series Datasets for the Hub

Im going to try to keep this short and sweet. I have intraday price data for every equity in the SP500 down to M5 going back 2-3 years. The data is in CSV format. There are gaps in the time series as for when trading was closed.

I want to contribute to the hub and add all the data there for anyone to use.

I cannot for the life of me figure out how to properly format it into a dataset that makes sense. Yes, I have read all the GluonTS docs that I can find, and have been using the code posted here (notebooks/time_series_datasets.ipynb at main · huggingface/notebooks · GitHub) as reference.
Been working on this problem for over 80 hours now and I’m embarrassed to say I haven’t figured it out yet.
I understand most of the requirements of the functions in the GluonTS notebook linked above and GluonTS PandasDatasets because I’ve tripped every error message possible while trying to hack this out.

I know it should be a simple thing to do and hope someone smarter than me can help:

This is the format of the data while its in CSV.
I’ve gotten modified versions of the linked notebook to run without error, but I cant figure out a version that results in a dataset that makes sense for actually training a model.

The final dataset should have a start key entry for each row in the time series, I’ve been working on the problem trying to get o,h,l,c,v, values into the target field, this may be wrong though because likely should be predicting for next day close delta, and the other fields are just data inputs to train on, but i don’t know.

This is the output format I’m trying to make as reference.

Whoever can give me some help will be greatly appreciated. I will offer you the pick of price history data, any market, US or international, down to 5m timeframe up to 3 years back.

There’s not a lot of point in training intraday back more than five years because of regime change, but daily data is available too, at a much larger timerange.

Hi ! You can reuse the code from monash_tsf.py · monash_tsf at main

It loads data from TSF files using the functions in utils.py but you can implement a function that reads the data from CSV. Maybe @kashif can help explaining the output format of the TSF parser ?

Yes sure let me have a look and get back to you!

I’m also interested in such a notebook, specially, in once you split the data into train, validation, and test, how do you load that as a dataset, to push to the Hub.

Hi,

We do have a notebook on that here: https://github.com/huggingface/notebooks/blob/main/examples/time_series_datasets.ipynb

It shows how to convert your custom data into a Hugging Face dataset.

Hi Thank you very much for your answer.
I’m familiar with that notebook, I used it to convert my custom time series dataset into a Hugging Face dataset. However, my question is: How do you split that resulting dataset into training, validation, and test? And save that back as a dataset object containing the three splits?

Right so the way i typically do the splits is in the “back-testing” setting where the validation set has the same time series as in the train split but say “prediction_length” more in the future and then the test set has the same time series as in the train/val but prediction_length more than in the validation… the test can also contain more of these rolling window where each time its prediction_length more etc.

1 Like