PreTrain GPT-2 from scratch for German on novel GC4 dataset

christopher · June 24, 2021, 12:13pm

PreTrain GPT-2 from scratch for German on novel GC4 dataset

To the best of our knowledge, there is no GPT-J model trained on a german corpus.

2. Language

The model will be trained in German.

3. Model

GPT-J

4. Datasets

The German colossal, cleaned Common Crawl corpus. GC4 Corpus — German NLP Group documentation
Another potential source of data is that which was used by the Digitale Bibliothek/Münchener Digitalisierungszentrum to pretrain their German GPT-2 variant

5. Training scripts

6. Challenges

Ridiculous amount of data renders both training and preprocessing challenging.

7. Desired project outcome

A competitive German language model, and a demo based on the HF SvelteKit inference demos

8. Team members

Christopher Akiki (myself) and Alina Mailach (@mailach)

patrickvonplaten · June 25, 2021, 5:39pm

This sounds like a more or less finished team already - cool Let’s note it down on Monday
I like the idea and think it’d be cool to make this project happen!

GPT-J and GPT2 are very similar, so it might be easiest to just use the FlaxGPT2 implementation actually If you think you require model parallelism however, the official GPT-J repo could be the way to go

In any way the run_flax_clm.py: transformers/run_clm_flax.py at master · huggingface/transformers · GitHub should be helpful here

Also for the massive amount of data, maybe dataset streaming could help? Load a Dataset in Streaming mode — datasets 1.8.0 documentation

stefan-it · June 28, 2021, 7:37am

Hi guys, I can provide the data that we’ve used for the GPT-2 variant, if needed

(It is “only” 16GB, so it’s ok for a kind of baseline, I think)

christopher · June 28, 2021, 10:20am

Hi @stefan-it ! That would be fantastic! You’re also more than welcome to join our efforts if you don’t already have a project to work on.

We could potentially mix both training sets for better diversity. The Common Crawl does sometimes tend to be noisy, and GC4 includes a quality field of sorts, if i’m not mistaken.

christopher · June 28, 2021, 10:22am

Thank you for the feedback Patrick!

I think we would rather work with the code you provided.

It also seems that the data already is part of the Hub which makes ingestion/pre-processing considerably easier than we initially thought. german-nlp-group/german_common_crawl · Datasets at Hugging Face

Suzana · June 29, 2021, 7:36am

Time to officially define this project!

Added everybody in the official sheet here, I also added you @stefan-it, not sure if you’ve decided to join the effort but feel free to leave a comment in the Google sheet and I can remove you if necessary.

christopher · June 29, 2021, 10:02am

Thank you Suzana!

stefan-it · July 2, 2021, 8:04pm

Hi @mailach ,

I’ve just sent out a mail with the link incl. some thoughts about vocab generation

Topic		Replies	Views
Pretrain GPT-J-6B from scratch on Arabic Flax/JAX Projects	8	2393	August 15, 2021
PreTrain GPT2 from scratch in Indonesia Flax/JAX Projects	13	757	June 30, 2021
Pretrain GPT2 from scratch in Korean Flax/JAX Projects	3	988	July 16, 2021
PreTrain GPT2 from scratch in Swedish Flax/JAX Projects	4	981	June 29, 2021
PreTrain GPT2 from scratch in Persian Flax/JAX Projects	15	2101	July 7, 2021

PreTrain GPT-2 from scratch for German on novel GC4 dataset

PreTrain GPT-2 from scratch for German on novel GC4 dataset

2. Language

3. Model

4. Datasets

5. Training scripts

6. Challenges

7. Desired project outcome

8. Team members

Related topics