Hello, I’m looking for a guide or some help to train gpt-2 from scratch on a small corpus. If this makes any difference the corpus is in italian, and I only know pytorch.
About tokenizers: is the tokenizer somehow language-agnostic?
About the corpus: is it supposed to be formatted in a specific way? I’ve seen some corpus’ files formatted with a space surrounding all punctuation symbols (before and after).