Datasets 2.1.0
Python 3.9.7
I have a dataset of 4 million time series examples where each time series is of length 800. They are stored on disk in individual files. I would like to create a HF Datasets object for this dataset. Here is what I do currently:
import pickle
import pandas as pd
from datasets import Dataset
file_counter = 0
dicts_list = []
with open(my_listfiles_path, 'r') as list_file:
for data_file in list_file:
full_data_path = '/'.join([my_listfiles_path, data_file.strip()])
data_dict = pickle.load(open(full_data_path, 'rb'))
tmp_dict = {k:v for k,v in data_dict.items()}
dicts_list.append(tmp_dict)
if file_counter % 1_000 == 0:
orig_df = pd.DataFrame.from_records(dicts_list)
if file_counter == 0:
my_dataset = Dataset.from_pandas(orig_df)
else:
for row in orig_df.itertuples(index=False):
tmp_row_dict = {
'time_data': row[1],
'label': row[0]}
my_dataset.add_item(tmp_row_dict)
del dicts_list
del orig_df
dicts_list = []
file_counter += 1
orig_df = pd.DataFrame.from_records(dicts_list)
for row in orig_df.itertuples(index=False):
tmp_row_dict = {
'time_data': row[1],
'label': row[0]}
my_dataset.add_item(tmp_row_dict)
periodically deleting the orig_df
and dicts_list
objects seems to keep the memory usage down. If I try to loop through all the files and store them into a pandas dataframe before converting them into a Datasets
object, I run out of memory.
My question is, is there a better way/more succinct/best practices way to do this? The files are custom time series data so I can’t download them which makes me think creating a loading script is the wrong approach (although I could be wrong).
Any suggestions are welcomed and much appreciated. Thanks in advance for your help!