How to tweak a dataset without a loading script?

mariosasko · June 21, 2023, 2:16pm

map/filter require a dataset to be constructed, so I wouldn’t call using them inside a dataset script (before building the dataset) “tweaking”. We can consider making specifying (and sharing) post-processing methods easier if we get more requests like this.

I could use glob for this, which would work locally. Or I could use a function to list the files in the repo, which would work on the hub.

Globbing will become consistent once we start using hugginggface_hub.HfFileSystem in datasets (very soon)

I tried to inherit from the Parquet builder, but it was like swimming upstream.

It really doesn’t seem like it should be this hard to go from a non-loader script to loader script…

Making this simple is the goal of https://github.com/huggingface/datasets/pull/5331.

Another solution is to make a new dataset without a loading script, and load that dataset from a dataset with a loading script.

The only issue is that this caches the dataset twice (should be OK for smaller datasets).

Topic		Replies	Views
Dataset scripts are no longer supported 🤗Datasets	4	971	July 22, 2025
Some issues about loading script of datasets 🤗Datasets	0	35	July 31, 2024
Data files not working with custom loading script and dataset 🤗Datasets	3	1354	May 2, 2023
Dataset creation template 🤗Datasets	3	327	August 29, 2023
Dataset loading script not working 🤗Datasets	2	431	August 31, 2023

How to tweak a dataset without a loading script?

Related topics