Thank you for your quick and positive reply! We would love to collaborate on this. The end goal of the project would be to create a pile sized corpus but for nordic languages as Swedish, Finnish, Danish,… We already have some data, but the idea would be to set up a framework that would allow a community to easily add new LM data.
The current plan now is to upload different datasets into different (sub)datasets on the HF hub, which we can then combine into one big dataset with a general script or with a data formatting pipeline. If you are interested in helping us it would be great if you joined our discord channel (#the-nordic-pile): AI Nordics and join the discussion or send me an email on email@example.com!