Hi!
Iām trying to modify a dataset on HF, that contains subsets ā essentially processing it and re-uploading it. e.g. this one Salesforce/lotsa_data Ā· Datasets at Hugging Face
What Iām doing currently is download the parquet for each subset, process it independently, store it in an S3 folder, and then use the CLI to upload the folder. This does not result in a dataset with subsetsā¦
Whatās the fastest way of reproducing the structure of the initial dataset after processing?
The latter is in Japanese, so it may be a little difficult to read. Google translate itā¦
I think that in the past, creating a loading_script was the best thing to do. Now, it may be smarter to use a custom DatasetBuilder.