PreTrain GPT-2 from scratch for German on novel GC4 dataset

PreTrain GPT-2 from scratch for German on novel GC4 dataset

To the best of our knowledge, there is no GPT-J model trained on a german corpus.

2. Language

The model will be trained in German.

3. Model

GPT-J

4. Datasets

The German colossal, cleaned Common Crawl corpus. GC4 Corpus — German NLP Group documentation
Another potential source of data is that which was used by the Digitale Bibliothek/Münchener Digitalisierungszentrum to pretrain their German GPT-2 variant

5. Training scripts

6. Challenges

Ridiculous amount of data renders both training and preprocessing challenging.

7. Desired project outcome

A competitive German language model, and a demo based on the HF SvelteKit inference demos

8. Team members

Christopher Akiki (myself) and Alina Mailach (@mailach)

3 Likes

More information on german models trained by Digitale Bibliothek/Münchener Digitalisierungszentrum
together with relevant papers can be found here here.

2 Likes

This sounds like a more or less finished team already - cool :slight_smile: Let’s note it down on Monday :slight_smile:
I like the idea and think it’d be cool to make this project happen!

GPT-J and GPT2 are very similar, so it might be easiest to just use the FlaxGPT2 implementation actually :slight_smile: If you think you require model parallelism however, the official GPT-J repo could be the way to go :slight_smile:

In any way the run_flax_clm.py: transformers/run_clm_flax.py at master · huggingface/transformers · GitHub should be helpful here :slight_smile:

Also for the massive amount of data, maybe dataset streaming could help? Load a Dataset in Streaming mode — datasets 1.8.0 documentation

2 Likes

Hi guys, I can provide the data that we’ve used for the GPT-2 variant, if needed :slight_smile:

(It is “only” 16GB, so it’s ok for a kind of baseline, I think)

1 Like

Hi @stefan-it ! That would be fantastic! You’re also more than welcome to join our efforts if you don’t already have a project to work on.

We could potentially mix both training sets for better diversity. The Common Crawl does sometimes tend to be noisy, and GC4 includes a quality field of sorts, if i’m not mistaken.

Thank you for the feedback Patrick!

I think we would rather work with the code you provided.

It also seems that the data already is part of the Hub which makes ingestion/pre-processing considerably easier than we initially thought. german-nlp-group/german_common_crawl · Datasets at Hugging Face

Time to officially define this project! :slight_smile:

Added everybody in the official sheet here, I also added you @stefan-it, not sure if you’ve decided to join the effort but feel free to leave a comment in the Google sheet and I can remove you if necessary. :slight_smile:

2 Likes

Thank you Suzana!

Hi @stefan-it ! It would be amazing if you could provide us with the data! Do you have it stored and can make it accessible?

1 Like

Hi @mailach ,

I’ve just sent out a mail with the link incl. some thoughts about vocab generation :hugs:

2 Likes

This is really helpful! Thank you so much!

1 Like