I am curious whether we can directly save a dataset as parquet as in the original dataset at HF.
For example fineweb-edu/sample/10BT has 13 parquet files
# download the dataset
fw = load_dataset(
"HuggingFaceFW/fineweb-edu",
name="sample-10BT,
split="train",
num_proc=10,
cache_dir=cache_dir,
)
fw.to_parquet(parquet_dir)
Loads these files as arrow although they are originally parquet, and when saving with to_parquet() it converts arrow to a single parquet file.
So my questions are:
How can i download/load and save them with the same structure in the repository?
How can i create multiple parquet files instead of single big parquet file with to_parquet?