Interestingly it works for allenai/c4 dataset from documentation example:
load_dataset("allenai/c4", name="en", data_files=["en/c4-train.00000-of-01024.json.gz"])
From the debugger I see that at some point during execution data_files will be transformed to absolute path with url:
https://huggingface.co/datasets/allenai/c4/resolve/607bd4c8450a42878aa9ddc051a65a055450ef87/en/c4-train.00000-of-01024.json.gz
However, this is done only for datasets without loading script in dataset_module_factory
and HubDatasetModuleFactoryWithoutScript
. Then, later data_path will be poped from builder with correct formatting here:
# datasets/load.py
dataset_module = dataset_module_factory(
path,
revision=revision,
download_config=download_config,
download_mode=download_mode,
data_dir=data_dir,
data_files=data_files,
)
# Get dataset builder class from the processing script
builder_cls = import_main_class(dataset_module.module_path)
builder_kwargs = dataset_module.builder_kwargs
data_files = builder_kwargs.pop("data_files", data_files) <-------- HERE, it will stay relative for datasets with loading script!
config_name = builder_kwargs.pop("config_name", name)
hash = builder_kwargs.pop("hash")
How then relative paths should be done for datasets with custom loading script?