Data files not working with custom loading script and dataset

Hi! I have a small test dataset of reinforcement learning agent episodes at Howuhh/nle_hf_dataset with simple structure:

  1. separate dir with metadata, just json for each episode
  2. separate dir with data, hdf5 file for each episode

I wrote a custom loading script, which loads the data according to the domain ‚Äúmetadata‚ÄĚ or ‚Äúdata‚ÄĚ. I want to be able to load some parts of data based on metadata filtering, thus I need a working data files functionality, like this:

load_dataset("Howuhh/nle_hf_dataset", "metadata", data_files=["metadata/2.json"])

I tested it locally with path to script instead of dataset name and it works and loads only specified parts. However, when I try this with dataset name it fails with this error:

--> 293     raise FileNotFoundError(error_msg)
    294 return sorted(out)

FileNotFoundError: Unable to find '' at /Users/a.p.nikulin/All/nle_hf_dataset/https:/

Why it searched in such a strange path?

Interestingly it works for allenai/c4 dataset from documentation example:

load_dataset("allenai/c4", name="en", data_files=["en/c4-train.00000-of-01024.json.gz"])

From the debugger I see that at some point during execution data_files will be transformed to absolute path with url:

However, this is done only for datasets without loading script in dataset_module_factory and HubDatasetModuleFactoryWithoutScript. Then, later data_path will be poped from builder with correct formatting here:

# datasets/
    dataset_module = dataset_module_factory(

    # Get dataset builder class from the processing script
    builder_cls = import_main_class(dataset_module.module_path)
    builder_kwargs = dataset_module.builder_kwargs
    data_files = builder_kwargs.pop("data_files", data_files)   <-------- HERE, it will stay relative for datasets with loading script!
    config_name = builder_kwargs.pop("config_name", name)
    hash = builder_kwargs.pop("hash")

How then relative paths should be done for datasets with custom loading script?

Hi ! It looks like data_files is only implemented for datasets without loading scripts - can you open an issue on github about this ?

Actually, they kinda work, but only when you pass absolute url, not relative path (and manually filter in the loading script based on data_files). So the real issue is with correct formatting. Base path should be appended to the relative path for datasets with custom scripts too.