Is there a suggested way of debugging dataset generators?


The datasets hub is very helpful by providing a lot of existing datasets. However, sometimes I need to use the dataset with a different format, which is decided by the _generate_examples() method. Is there a way of adding a break point in the IDE and directly debugging this method? It seems complicated to me since the dataset scripts are copied to a dynamic path (with hash codes) before being loaded, so adding break points to the original file won’t trigger pauses at all.

Right now you can only set break points to the copy of the script that is located in the HF datasets modules cache (default at ~/.cache/huggingface/modules/datasets_modules), because this is where the script is imported from


Another approach would be to clone the dataset repository and work (and debug) locally your files:

  1. Clone
    git clone<dataset_namespace>/<dataset_name> /path/to/some/local/directory
  2. Edit your local file at /path/to/some/local/directory/<dataset_name>
  3. Test loading from your local file
    ds = load_dataset("/path/to/some/local/directory/<dataset_name>",...)
To debug an existing script with minimal changes and making it easily debuggable for everyone thereafter you can simply work around the abstraction in datasets.GeneratorBasedBuilder.

:handshake: Tip: if you add this simple edit to your own dataset loading scripts it will become directly debuggable for everyone without affecting the normal operation of the load_dataset().


  1. download the original data script you want to debug/ edit
  2. add print(paths) to _generate_examples() to print the .cache/huggingface/path_to_file_hash needed to locate the data for debug mode
  3. call the original script once to download and print the data file names
from datasets import load_dataset
# this will load download the data once, cache them and print the location of the cached files
load_dataset('the_loading_script_name', 'the-config')
  1. add call_generate_examples() method to the datasets.GeneratorBasedBuilder class in the script
  2. create an instance of the NewDataset(datasets.GeneratorBasedBuilder) class and call the call_generate_examples(file_paths)

Minimal code example:

# in the dataset class (that inherits from datasets.GeneratorBasedBuilder) call the concrete implement of _generate_examples 
class NewDataset(datasets.GeneratorBasedBuilder):

    # for STEP 2: print the filepaths in cache for step 5, 4, 3
    def _generate_examples(self, filepaths):
        print("Cached PATHS -- copy into STEP 5:", filepaths) 
        ... # rest unchanged or temporary exit() for step 3
    # for STEP: 4 add this method to call the `_generate_examples()`, because direct calling points to an abstract (empty)  `_generate_examples()`
    def call_generate_examples(self, filepaths): # add this block to NewDataset(datasets.GeneratorBasedBuilder):
        # initialize the generator and 
        # pass in paths from cache (see STEP 2)
        [_ for _ in self._generate_examples(filepaths)] 

# for STEP 5, add a main block, that is ignored during `load_dataset()`
# start debugger here/ in this .py
if __name__ == '__main__':
    ds = NewDataset(config_name='')  # params as required by the original dataset, but optional parameters get ignored  
    # copy in here what  print("Cached PATHS -- copy into STEP 5:", filepaths) printed on the terminal
    # mind [], [[]] to match original implementation
    ds = ds.call_generate_examples([['/.../.cache/huggingface/datasets/downloads/somelonghash', '/.../.cache/huggingface/datasets/downloads/extracted/...']])