Creating a Dataset object from large pandas dataframe

Datasets 2.1.0
Python 3.9.7

I have a dataset of 4 million time series examples where each time series is of length 800. They are stored on disk in individual files. I would like to create a HF Datasets object for this dataset. Here is what I do currently:

import pickle
import pandas as pd
from datasets import Dataset

file_counter = 0
dicts_list = []
with open(my_listfiles_path, 'r') as list_file:
    for data_file in list_file:
        full_data_path = '/'.join([my_listfiles_path, data_file.strip()])
        data_dict = pickle.load(open(full_data_path, 'rb'))
        tmp_dict = {k:v for k,v in data_dict.items()}
        dicts_list.append(tmp_dict)
        if file_counter % 1_000 == 0:
            orig_df = pd.DataFrame.from_records(dicts_list)
            
            if file_counter == 0:
                my_dataset = Dataset.from_pandas(orig_df)
            else:
                for row in orig_df.itertuples(index=False):
                    tmp_row_dict = {
                        'time_data': row[1],
                        'label': row[0]}
                    my_dataset.add_item(tmp_row_dict)
            
            del dicts_list
            del orig_df
            dicts_list = []
        file_counter += 1

orig_df = pd.DataFrame.from_records(dicts_list)
for row in orig_df.itertuples(index=False):
    tmp_row_dict = {
        'time_data': row[1],
        'label': row[0]}

    my_dataset.add_item(tmp_row_dict)

periodically deleting the orig_df and dicts_list objects seems to keep the memory usage down. If I try to loop through all the files and store them into a pandas dataframe before converting them into a Datasets object, I run out of memory.

My question is, is there a better way/more succinct/best practices way to do this? The files are custom time series data so I can’t download them which makes me think creating a loading script is the wrong approach (although I could be wrong).

Any suggestions are welcomed and much appreciated. Thanks in advance for your help!

Hi! add_item also keeps the items in memory, so the loading script approach is the right one. We plan to add Dataset.from_generator to the API soon to make the dataset loading from a simple generator more convenient.

@mariosasko Thank you for your response. Do you have a reference for creating a loading script that is similar to my usecase outlined initially?

I’m still a bit confused about how to do this properly. Allow me to describe the structure of my data and how I am going about writing the loading script.

I have 4 million data files in pickle format. Each file represents a single example. By going through the datascript documentation (Create a dataset loading script) and following along an example template (datasets/new_dataset_script.py at main · huggingface/datasets · GitHub), it looks to me that instead of having 4 million individual files, one should probably have a single file that contains all 4 million examples. I have done just that, namely looped through all the individual files and written the examples into a single csv file.

I say this as the _generate_examples method yields a generator and it is not clear to me how to yield a generator from millions of individual files. The aggregated single file can be thought of conceptually as a pandas dataframe.

What I’m confused about is how the HF dataset creation happens. The structure of my data directory is as follows:

project
|---proto_data
    |   my_big_data.csv
    |   proto_data.py

my_big_data.csv is the single large data file mentioned above. proto_data.py is the dataset loading script, shown below:

import csv
import json
import os
import datasets
from datasets import DownloadManager
from typing import List

_DESCRIPTION = 'my dataset loading script'
_URLS = {
    'csv_file': 'path/to/project/proto_data/my_big_data.csv'
}

class ProtoData(datasets.GeneratorBasedBuilder):

    def _info(self):
        features = datasets.Features(
            {
                'from': datasets.Value('string')
            }
        )

        return datasets.DatasetInfo(
            features=features
        )

    def _split_generators(self, dl_manager: DownloadManager) -> List[datasets.SplitGenerator]:
        urls_to_use = _URLS
        downloaded_files = dl_manager.download_and_extract(urls_to_use)
        
        return [
            datasets.SplitGenerator(name=datasets.Split.TRAIN, gen_kwargs={'filepath': downloaded_files['csv_file']})
        ]

    def _generate_examples(self, filepath):
        with open(filepath) as f:
            for key, row in enumerate(f):
                data = csv.reader(f, delimiter=',')
                yield key, {'label': data['from']}

At the begining of the dataset loading script documentation, it states

Any dataset script, for example my_dataset.py, can be placed in a folder or a repository named my_dataset and be loaded

For me, this would look like:

from datasets import load_dataset
load_dataset("path/to/project/proto_data")

While at the end in the Run the tests section, the documentation states

If both tests pass, your dataset was generated correctly!

Naively, I am guessing that running the tests generates the entire dataset in "path/to/project/proto_data" and load_dataset("path/to/project/proto_data") loads the dataset when you are ready to start using it for training or finetuning. Is this correct? Also, is it standard practice to run the tests each time you need to incorporate a change to the dataset loading script (i.e. I add another element to the yielded dictionary like data['feature_A'])?

Thank you in advance for all of your help, I greatly appreciate it!