PreTrain GPT-2 from scratch for German on novel GC4 dataset
To the best of our knowledge, there is no GPT-J model trained on a german corpus.
The model will be trained in German.
The German colossal, cleaned Common Crawl corpus. GC4 Corpus — German NLP Group documentation
Another potential source of data is that which was used by the Digitale Bibliothek/Münchener Digitalisierungszentrum to pretrain their German GPT-2 variant
5. Training scripts
Ridiculous amount of data renders both training and preprocessing challenging.
7. Desired project outcome
A competitive German language model, and a demo based on the HF SvelteKit inference demos
8. Team members
Christopher Akiki (myself) and Alina Mailach (@mailach)
More information on german models trained by Digitale Bibliothek/Münchener Digitalisierungszentrum
together with relevant papers can be found here here.
This sounds like a more or less finished team already - cool Let’s note it down on Monday
I like the idea and think it’d be cool to make this project happen!
GPT-J and GPT2 are very similar, so it might be easiest to just use the FlaxGPT2 implementation actually If you think you require model parallelism however, the official GPT-J repo could be the way to go
In any way the
run_flax_clm.py: transformers/run_clm_flax.py at master · huggingface/transformers · GitHub should be helpful here
Also for the massive amount of data, maybe dataset streaming could help? Load a Dataset in Streaming mode — datasets 1.8.0 documentation
Hi guys, I can provide the data that we’ve used for the GPT-2 variant, if needed
(It is “only” 16GB, so it’s ok for a kind of baseline, I think)
Hi @stefan-it ! That would be fantastic! You’re also more than welcome to join our efforts if you don’t already have a project to work on.
We could potentially mix both training sets for better diversity. The Common Crawl does sometimes tend to be noisy, and GC4 includes a quality field of sorts, if i’m not mistaken.
Thank you for the feedback Patrick!
I think we would rather work with the code you provided.
It also seems that the data already is part of the Hub which makes ingestion/pre-processing considerably easier than we initially thought. german-nlp-group/german_common_crawl · Datasets at Hugging Face
Time to officially define this project!
Added everybody in the official sheet here, I also added you @stefan-it, not sure if you’ve decided to join the effort but feel free to leave a comment in the Google sheet and I can remove you if necessary.
Hi @stefan-it ! It would be amazing if you could provide us with the data! Do you have it stored and can make it accessible?
Hi @mailach ,
I’ve just sent out a mail with the link incl. some thoughts about vocab generation
This is really helpful! Thank you so much!