How to prepare and upload large csv dataset using git lfs or push_to_hub

khcy82dyc · March 9, 2022, 11:33am

I’m planning to upload around 50GB of CSV files to my huggingface dataset and I wonder what’s the proper to push them?
Should we use push_to_hub, or git lfs? and what’s the proper way to process the csv files before uploading?

mariosasko · March 21, 2022, 12:30pm

Hi! You’ll probably get better performance (faster upload) by using git lfs. push_to_hub stores data in the compressed parquet format, which can save a lot of bandwidth, but doesn’t use a git-based workflow (currently) resulting in slower upload speeds in most situations.

Topic		Replies	Views
Failed to upload 5GB csv to huggingface dataset Beginners	3	505	March 10, 2022
Cannot push dataset of 100k images 🤗Hub	0	578	March 26, 2023
Is there a size limit for dataset hosting 🤗Datasets	11	14176	August 24, 2023
Any workaround for push_to_hub() limits? 🤗Datasets	9	2166	May 2, 2024
How to upload big jsonl files effeciently? 🤗Datasets	2	548	November 4, 2023

How to prepare and upload large csv dataset using git lfs or push_to_hub

Related topics