Can we upload datasets with a total size like the pile?

Hi!
We want to create a very large text dataset inspired by the pile. To distribute the dataset we would like to upload the dataset on datasets of hugginface. But we were wondering if there any limitations in sizes for uploading datasets?

Thank you for your responses!

Hi!

That’s so cool! No, we don’t have any limitations in terms of size. As the dataset you are working on is pretty big, would you be interested in collaborating with us? This way it will be easier for us to help you. Also, could you tell us a bit more about your project? Feel free to let me know here or via e-mail (mario@huggingface.co).

cc @thomwolf

Hi!

Thank you for your quick and positive reply! We would love to collaborate on this. The end goal of the project would be to create a pile sized corpus but for nordic languages as Swedish, Finnish, Danish,… We already have some data, but the idea would be to set up a framework that would allow a community to easily add new LM data.

The current plan now is to upload different datasets into different (sub)datasets on the HF hub, which we can then combine into one big dataset with a general script or with a data formatting pipeline. If you are interested in helping us it would be great if you joined our discord channel (#the-nordic-pile): AI Nordics and join the discussion or send me an email on severine.verlinden@ai.se!

1 Like