Exhaustive list of changes across all touchpoints in the tokenization pipeline of LM training

Hi, I am writing the code for training an LM from scratch on a custom dataset following the run_clm_no_trainer.py file. Please help me with the following questions. I could not find clear answers in the docs. Please point me to references in case I may have missed them.

A desired example in the batch should look like this:
BOS example_1 EOS BOS example_2 EOS … BOS example_n EOS PAD … PAD

  • Q1. Is there an efficient method for grouping and padding data examples like above? My use cases: (1) pad s.t. the maximum number of data examples fit in context, (2) pad s.t. k data examples fit in the context (where k is a fixed natural number). run_clm_no_trainer.py has a custom method group_texts that does grouping (no padding) that can be modified. However, I think my use cases are pretty standard and some built-in method should exist.
  • Q2. [Design Choice] Is using a SEP token better than using the BOS and EOS? What are the considerations here?
  • Q3. What tokens do I need to update in the tokenizer and model? I have BOS, EOS, PAD and UNK (or SEP in place of BOS and EOS depending on the answer to Q2). UNK goes into the tokenizer while instantiating the tokenizer’s model (WordLevel in my case). BOS and EOS token IDs go in the model config (I am using GPT-2). What about PAD and SEP? Also, what changes are needed from my end if I want to use EOS as PAD?
  • Q4. HF won’t compute the loss for the predictions at special tokens, right? I read somewhere that tokenizer sets labels as -100 for special tokens which are ignored while computing the loss. Please confirm if that is correct.
  • Q5. What should the token_type_ids be? Since I am using GPT-2, should I care about token_type_ids? If yes, what does GPT-2 expect – same token_type_ids for all tokens or different token_type_ids for each segment (i.e. “BOS example_k EOS” gets k-th token_type_id) in a batch sentence like in BERT?

I originally posted this on Tokenizers forum here. Reposting here since I think the question is generally important and should help a lot of people once answered. Please feel free to suggest better venues for this discussion if any. Thanks in advance. I would greatly appreciate any help on this topic.