Is there a suggested way of debugging dataset generators?

NeuralNotwork · August 15, 2022, 9:45am

Hi,

The datasets hub is very helpful by providing a lot of existing datasets. However, sometimes I need to use the dataset with a different format, which is decided by the _generate_examples() method. Is there a way of adding a break point in the IDE and directly debugging this method? It seems complicated to me since the dataset scripts are copied to a dynamic path (with hash codes) before being loaded, so adding break points to the original file won’t trigger pauses at all.

Thanks in advance,
Shao

lhoestq · August 19, 2022, 10:25am

Right now you can only set break points to the copy of the script that is located in the HF datasets modules cache (default at ~/.cache/huggingface/modules/datasets_modules), because this is where the script is imported from

albertvillanova · August 22, 2022, 7:19am

Another approach would be to clone the dataset repository and work (and debug) locally your files:

Clone

git clone https://huggingface.co/datasets/<dataset_namespace>/<dataset_name> /path/to/some/local/directory

Edit your local file at /path/to/some/local/directory/<dataset_name>

Test loading from your local file

ds = load_dataset("/path/to/some/local/directory/<dataset_name>",...)

nils-rethmeier · January 26, 2023, 8:24pm

To debug an existing script with minimal changes and making it easily debuggable for everyone thereafter you can simply work around the abstraction in datasets.GeneratorBasedBuilder.

Tip: if you add this simple edit to your own dataset loading scripts it will become directly debuggable for everyone without affecting the normal operation of the load_dataset().

Steps:

download the original data script you want to debug/ edit
add print(paths) to _generate_examples() to print the .cache/huggingface/path_to_file_hash needed to locate the data for debug mode
call the original script once to download and print the data file names

from datasets import load_dataset
# this will load download the data once, cache them and print the location of the cached files
load_dataset('the_loading_script_name', 'the-config')

add call_generate_examples() method to the datasets.GeneratorBasedBuilder class in the script
create an instance of the NewDataset(datasets.GeneratorBasedBuilder) class and call the call_generate_examples(file_paths)

Minimal code example:

# some_dataset_reader.py
# in the dataset class (that inherits from datasets.GeneratorBasedBuilder) call the concrete implement of _generate_examples 
class NewDataset(datasets.GeneratorBasedBuilder):

    # for STEP 2: print the filepaths in cache for step 5, 4, 3
    def _generate_examples(self, filepaths):
        print("Cached PATHS -- copy into STEP 5:", filepaths) 
        ... # rest unchanged or temporary exit() for step 3
    
    # for STEP: 4 add this method to call the `_generate_examples()`, because direct calling points to an abstract (empty)  `_generate_examples()`
    def call_generate_examples(self, filepaths): # add this block to NewDataset(datasets.GeneratorBasedBuilder):
        # initialize the generator and 
        # pass in paths from cache (see STEP 2)
        [_ for _ in self._generate_examples(filepaths)] 

# for STEP 5, add a main block, that is ignored during `load_dataset()`
# start debugger here/ in this .py
if __name__ == '__main__':
    print("DEBUGGING")
    ds = NewDataset(config_name='the_config.name')  # params as required by the original dataset, but optional parameters get ignored  
    # copy in here what  print("Cached PATHS -- copy into STEP 5:", filepaths) printed on the terminal
    # mind [], [[]] to match original implementation
    ds = ds.call_generate_examples([['/.../.cache/huggingface/datasets/downloads/somelonghash', '/.../.cache/huggingface/datasets/downloads/extracted/...']])

Topic		Replies	Views
Dataset.from_generator() cost much more time in vscode debugging mode then running mode 🤗Datasets	4	624	October 10, 2023
Dataset creation template 🤗Datasets	3	298	August 29, 2023
Testing and dummy data required for dataset loading script? 🤗Datasets	2	695	August 8, 2022
Some issues about loading script of datasets 🤗Datasets	0	20	July 31, 2024
DatasetGenerationError: An error occurred while generating the dataset 🤗Datasets	9	21452	September 13, 2023

Is there a suggested way of debugging dataset generators?

Steps:

Minimal code example:

Related topics