Hi, I am writing the code for training an LM from scratch on a custom dataset following the run_clm_no_trainer.py file. Please help me with the following questions. I could not find clear answers in the docs. Please point me to references in case I may have missed them.
An example in the batch should look like this:
BOS example_1 EOS BOS example_2 EOS … BOS example_n EOS PAD … PAD
- Q1. Is there an efficient method for grouping and padding data examples like above? My use cases: (1) pad s.t. the maximum number of data examples fit in context, (2) pad s.t. k data examples fit in the context (where k is a fixed natural number).
run_clm_no_trainer.pyhas a custom method
group_textsthat does grouping (no padding) that can be modified. However, I think my use cases are pretty standard and some built-in method should exist.
- Q2. [Design Choice] Is using a SEP token better than using the BOS and EOS? What are the considerations here?
- Q3. What tokens do I need to update in the tokenizer and model? I have BOS, EOS, PAD and UNK (or SEP in place of BOS and EOS depending on the answer to Q2). UNK goes into the tokenizer while instantiating the tokenizer’s model (WordLevel in my case). BOS and EOS token IDs go in the model config (I am using GPT-2). What about PAD and SEP? Also, what changes are needed from my end if I want to use EOS as PAD?
- Q4. HF won’t compute the loss for the predictions at special tokens, right? I read somewhere that tokenizer sets labels as
-100for special tokens which are ignored while computing the loss. Please confirm if that is correct.
- Q5. What should the
token_type_idsbe? Since I am using GPT-2, should I care about
token_type_ids? If yes, what does GPT-2 expect – same
token_type_idsfor all tokens or different
token_type_idsfor each segment (i.e. “BOS example_k EOS” gets k-th
token_type_id) in a batch sentence like in BERT?
If the questions are more appropriate for the Transformers forum, please let me know. Thanks.