Is there a suggested way of debugging dataset generators?


The datasets hub is very helpful by providing a lot of existing datasets. However, sometimes I need to use the dataset with a different format, which is decided by the _generate_examples() method. Is there a way of adding a break point in the IDE and directly debugging this method? It seems complicated to me since the dataset scripts are copied to a dynamic path (with hash codes) before being loaded, so adding break points to the original file won’t trigger pauses at all.

Thanks in advance,

1 Like

Right now you can only set break points to the copy of the script that is located in the HF datasets modules cache (default at ~/.cache/huggingface/modules/datasets_modules), because this is where the script is imported from

1 Like

Another approach would be to clone the dataset repository and work (and debug) locally your files:

  1. Clone
    git clone<dataset_namespace>/<dataset_name> /path/to/some/local/directory
  2. Edit your local file at /path/to/some/local/directory/<dataset_name>
  3. Test loading from your local file
    ds = load_dataset("/path/to/some/local/directory/<dataset_name>",...)
1 Like