map
/filter
require a dataset to be constructed, so I wouldn’t call using them inside a dataset script (before building the dataset) “tweaking”. We can consider making specifying (and sharing) post-processing methods easier if we get more requests like this.
I could use glob for this, which would work locally. Or I could use a function to list the files in the repo, which would work on the hub.
Globbing will become consistent once we start using hugginggface_hub.HfFileSystem
in datasets
(very soon)
I tried to inherit from the Parquet builder, but it was like swimming upstream.
It really doesn’t seem like it should be this hard to go from a non-loader script to loader script…
Making this simple is the goal of https://github.com/huggingface/datasets/pull/5331.
Another solution is to make a new dataset without a loading script, and load that dataset from a dataset with a loading script.
The only issue is that this caches the dataset twice (should be OK for smaller datasets).