Hi,
The datasets hub is very helpful by providing a lot of existing datasets. However, sometimes I need to use the dataset with a different format, which is decided by the _generate_examples()
method. Is there a way of adding a break point in the IDE and directly debugging this method? It seems complicated to me since the dataset scripts are copied to a dynamic path (with hash codes) before being loaded, so adding break points to the original file won’t trigger pauses at all.
Thanks in advance,
Shao
1 Like
Right now you can only set break points to the copy of the script that is located in the HF datasets modules cache (default at ~/.cache/huggingface/modules/datasets_modules
), because this is where the script is imported from
2 Likes
Another approach would be to clone the dataset repository and work (and debug) locally your files:
- Clone
git clone https://huggingface.co/datasets/<dataset_namespace>/<dataset_name> /path/to/some/local/directory
- Edit your local file at /path/to/some/local/directory/<dataset_name>
- Test loading from your local file
ds = load_dataset("/path/to/some/local/directory/<dataset_name>",...)
1 Like
To debug an existing script with minimal changes and making it easily debuggable for everyone thereafter you can simply work around the abstraction in datasets.GeneratorBasedBuilder
.
Tip: if you add this simple edit to your own dataset loading scripts it will become directly debuggable for everyone without affecting the normal operation of the load_dataset()
.
Steps:
- download the original data script you want to debug/ edit
- add
print(paths)
to _generate_examples()
to print the .cache/huggingface/path_to_file_hash
needed to locate the data for debug mode
- call the original script once to download and print the data file names
from datasets import load_dataset
# this will load download the data once, cache them and print the location of the cached files
load_dataset('the_loading_script_name', 'the-config')
- add
call_generate_examples()
method to the datasets.GeneratorBasedBuilder
class in the script
- create an instance of the NewDataset(datasets.GeneratorBasedBuilder) class and call the
call_generate_examples(file_paths)
Minimal code example:
# some_dataset_reader.py
# in the dataset class (that inherits from datasets.GeneratorBasedBuilder) call the concrete implement of _generate_examples
class NewDataset(datasets.GeneratorBasedBuilder):
# for STEP 2: print the filepaths in cache for step 5, 4, 3
def _generate_examples(self, filepaths):
print("Cached PATHS -- copy into STEP 5:", filepaths)
... # rest unchanged or temporary exit() for step 3
# for STEP: 4 add this method to call the `_generate_examples()`, because direct calling points to an abstract (empty) `_generate_examples()`
def call_generate_examples(self, filepaths): # add this block to NewDataset(datasets.GeneratorBasedBuilder):
# initialize the generator and
# pass in paths from cache (see STEP 2)
[_ for _ in self._generate_examples(filepaths)]
# for STEP 5, add a main block, that is ignored during `load_dataset()`
# start debugger here/ in this .py
if __name__ == '__main__':
print("DEBUGGING")
ds = NewDataset(config_name='the_config.name') # params as required by the original dataset, but optional parameters get ignored
# copy in here what print("Cached PATHS -- copy into STEP 5:", filepaths) printed on the terminal
# mind [], [[]] to match original implementation
ds = ds.call_generate_examples([['/.../.cache/huggingface/datasets/downloads/somelonghash', '/.../.cache/huggingface/datasets/downloads/extracted/...']])