How to tweak a dataset without a loading script?

map/filter require a dataset to be constructed, so I wouldn’t call using them inside a dataset script (before building the dataset) “tweaking”. We can consider making specifying (and sharing) post-processing methods easier if we get more requests like this.

I could use glob for this, which would work locally. Or I could use a function to list the files in the repo, which would work on the hub.

Globbing will become consistent once we start using hugginggface_hub.HfFileSystem in datasets (very soon)

I tried to inherit from the Parquet builder, but it was like swimming upstream.

It really doesn’t seem like it should be this hard to go from a non-loader script to loader script…

Making this simple is the goal of https://github.com/huggingface/datasets/pull/5331.

Another solution is to make a new dataset without a loading script, and load that dataset from a dataset with a loading script.

The only issue is that this caches the dataset twice (should be OK for smaller datasets).