PreTrain GPT-2 from scratch for German on novel GC4 dataset
To the best of our knowledge, there is no GPT-J model trained on a german corpus.
2. Language
The model will be trained in German.
3. Model
GPT-J
4. Datasets
The German colossal, cleaned Common Crawl corpus. GC4 Corpus — German NLP Group documentation
Another potential source of data is that which was used by the Digitale Bibliothek/Münchener Digitalisierungszentrum to pretrain their German GPT-2 variant
5. Training scripts
6. Challenges
Ridiculous amount of data renders both training and preprocessing challenging.
7. Desired project outcome
A competitive German language model, and a demo based on the HF SvelteKit inference demos
8. Team members
Christopher Akiki (myself) and Alina Mailach (@mailach)
This sounds like a more or less finished team already - cool Let’s note it down on Monday
I like the idea and think it’d be cool to make this project happen!
GPT-J and GPT2 are very similar, so it might be easiest to just use the FlaxGPT2 implementation actually If you think you require model parallelism however, the official GPT-J repo could be the way to go
Hi @stefan-it ! That would be fantastic! You’re also more than welcome to join our efforts if you don’t already have a project to work on.
We could potentially mix both training sets for better diversity. The Common Crawl does sometimes tend to be noisy, and GC4 includes a quality field of sorts, if i’m not mistaken.
Added everybody in the official sheet here, I also added you @stefan-it, not sure if you’ve decided to join the effort but feel free to leave a comment in the Google sheet and I can remove you if necessary.