Load Dataset and Save as Parquet

I am curious whether we can directly save a dataset as parquet as in the original dataset at HF.

For example fineweb-edu/sample/10BT has 13 parquet files

# download the dataset
fw = load_dataset(


Loads these files as arrow although they are originally parquet, and when saving with to_parquet() it converts arrow to a single parquet file.

So my questions are:
How can i download/load and save them with the same structure in the repository?
How can i create multiple parquet files instead of single big parquet file with to_parquet?

Hi ! You can download a repository (and specify a subset of files) using the huggingface-cli :slight_smile:

Also right now the datasets lib doesn’t support writing to multiple Parquet files yet, though it would be amazing to contribute this feature

Maybe something like this ?

fw.to_parquet("path/to/parquet/", max_shard_size="500MB")