Load Dataset and Save as Parquet

bkopru · June 25, 2024, 1:32pm

I am curious whether we can directly save a dataset as parquet as in the original dataset at HF.

For example fineweb-edu/sample/10BT has 13 parquet files

# download the dataset
fw = load_dataset(
    "HuggingFaceFW/fineweb-edu",
    name="sample-10BT,
    split="train",
    num_proc=10,
    cache_dir=cache_dir,
)


fw.to_parquet(parquet_dir)

Loads these files as arrow although they are originally parquet, and when saving with to_parquet() it converts arrow to a single parquet file.

So my questions are:
How can i download/load and save them with the same structure in the repository?
How can i create multiple parquet files instead of single big parquet file with to_parquet?

lhoestq · July 10, 2024, 2:56pm

Hi ! You can download a repository (and specify a subset of files) using the huggingface-cli

Also right now the datasets lib doesn’t support writing to multiple Parquet files yet, though it would be amazing to contribute this feature

Maybe something like this ?

fw.to_parquet("path/to/parquet/", max_shard_size="500MB")

BeeGass · January 6, 2025, 7:06pm

By any chance was this made into a feature request or perhaps even implemented? I agree that this would be a great feature

lhoestq · January 7, 2025, 11:24am

there is an open issue here: Save Dataset as Sharded Parquet · Issue #7047 · huggingface/datasets · GitHub

Topic		Replies	Views
How to publish a text to-image dataset on huggingface 🤗Datasets	1	59	February 9, 2025
How can I convert a loaded dataset in to a parquet file and save it to the S3 🤗Datasets	2	4342	July 31, 2023
Help creating dataset from s3 bucket with parquet files 🤗Datasets	2	1099	July 27, 2023
How do I download and load a dataset in batches without caching all of it? 🤗Datasets	1	229	September 16, 2024
How to disable caching in load_dataset()? 🤗Datasets	6	6237	July 10, 2024

Load Dataset and Save as Parquet

Related topics