Is there a suggested way of debugging dataset generators?

Hi,

The datasets hub is very helpful by providing a lot of existing datasets. However, sometimes I need to use the dataset with a different format, which is decided by the _generate_examples() method. Is there a way of adding a break point in the IDE and directly debugging this method? It seems complicated to me since the dataset scripts are copied to a dynamic path (with hash codes) before being loaded, so adding break points to the original file won’t trigger pauses at all.

Thanks in advance,
Shao

1 Like

Right now you can only set break points to the copy of the script that is located in the HF datasets modules cache (default at ~/.cache/huggingface/modules/datasets_modules), because this is where the script is imported from

2 Likes

Another approach would be to clone the dataset repository and work (and debug) locally your files:

  1. Clone
    git clone https://huggingface.co/datasets/<dataset_namespace>/<dataset_name> /path/to/some/local/directory
    
  2. Edit your local file at /path/to/some/local/directory/<dataset_name>
  3. Test loading from your local file
    ds = load_dataset("/path/to/some/local/directory/<dataset_name>",...)
    
1 Like

To debug an existing script with minimal changes and making it easily debuggable for everyone thereafter you can simply work around the abstraction in datasets.GeneratorBasedBuilder.

:handshake: Tip: if you add this simple edit to your own dataset loading scripts it will become directly debuggable for everyone without affecting the normal operation of the load_dataset().

Steps:

  1. download the original data script you want to debug/ edit
  2. add print(paths) to _generate_examples() to print the .cache/huggingface/path_to_file_hash needed to locate the data for debug mode
  3. call the original script once to download and print the data file names
from datasets import load_dataset
# this will load download the data once, cache them and print the location of the cached files
load_dataset('the_loading_script_name', 'the-config')
  1. add call_generate_examples() method to the datasets.GeneratorBasedBuilder class in the script
  2. create an instance of the NewDataset(datasets.GeneratorBasedBuilder) class and call the call_generate_examples(file_paths)

Minimal code example:

# some_dataset_reader.py
# in the dataset class (that inherits from datasets.GeneratorBasedBuilder) call the concrete implement of _generate_examples 
class NewDataset(datasets.GeneratorBasedBuilder):

    # for STEP 2: print the filepaths in cache for step 5, 4, 3
    def _generate_examples(self, filepaths):
        print("Cached PATHS -- copy into STEP 5:", filepaths) 
        ... # rest unchanged or temporary exit() for step 3
    
    # for STEP: 4 add this method to call the `_generate_examples()`, because direct calling points to an abstract (empty)  `_generate_examples()`
    def call_generate_examples(self, filepaths): # add this block to NewDataset(datasets.GeneratorBasedBuilder):
        # initialize the generator and 
        # pass in paths from cache (see STEP 2)
        [_ for _ in self._generate_examples(filepaths)] 

# for STEP 5, add a main block, that is ignored during `load_dataset()`
# start debugger here/ in this .py
if __name__ == '__main__':
    print("DEBUGGING")
    ds = NewDataset(config_name='the_config.name')  # params as required by the original dataset, but optional parameters get ignored  
    # copy in here what  print("Cached PATHS -- copy into STEP 5:", filepaths) printed on the terminal
    # mind [], [[]] to match original implementation
    ds = ds.call_generate_examples([['/.../.cache/huggingface/datasets/downloads/somelonghash', '/.../.cache/huggingface/datasets/downloads/extracted/...']])