Train gpt-2 from scratch in Italian

Hello, I’m looking for a guide or some help to train gpt-2 from scratch on a small corpus. If this makes any difference the corpus is in italian, and I only know pytorch.

About tokenizers: is the tokenizer somehow language-agnostic?

About the corpus: is it supposed to be formatted in a specific way? I’ve seen some corpus’ files formatted with a space surrounding all punctuation symbols (before and after).

Is this the recomended script?

or this??

I’ve found this guide but uses tensorflow???