Can we upload datasets with a total size like the pile?

Severine · December 10, 2021, 10:02am

Hi!
We want to create a very large text dataset inspired by the pile. To distribute the dataset we would like to upload the dataset on datasets of hugginface. But we were wondering if there any limitations in sizes for uploading datasets?

Thank you for your responses!

mariosasko · December 10, 2021, 12:38pm

Hi!

That’s so cool! No, we don’t have any limitations in terms of size. As the dataset you are working on is pretty big, would you be interested in collaborating with us? This way it will be easier for us to help you. Also, could you tell us a bit more about your project? Feel free to let me know here or via e-mail (mario@huggingface.co).

cc @thomwolf

Severine · December 10, 2021, 11:00pm

Hi!

Thank you for your quick and positive reply! We would love to collaborate on this. The end goal of the project would be to create a pile sized corpus but for nordic languages as Swedish, Finnish, Danish,… We already have some data, but the idea would be to set up a framework that would allow a community to easily add new LM data.

The current plan now is to upload different datasets into different (sub)datasets on the HF hub, which we can then combine into one big dataset with a general script or with a data formatting pipeline. If you are interested in helping us it would be great if you joined our discord channel (#the-nordic-pile): AI Nordics and join the discussion or send me an email on severine.verlinden@ai.se!

Topic		Replies	Views
Is there a size limit for dataset hosting 🤗Datasets	11	14407	August 24, 2023
I had collected data for a language text for translation How can I add it up into datsets 🤗Datasets	7	1590	August 23, 2021
Is it possible to upload 4tb+ open source dataset? 🤗Datasets	1	266	February 27, 2023
Max individual file size for LFS files is 46.6GB 🤗Datasets	2	3225	May 19, 2022
Failed to upload 5GB csv to huggingface dataset Beginners	3	511	March 10, 2022

Can we upload datasets with a total size like the pile?

Related topics