Seeking an end-to-end example of grouping, tokenization and padding to construct preprocessed data in HF

mdrpanwar · June 26, 2023, 12:19pm

Hi, I am writing the code for training an LM from scratch on a custom dataset following the run_clm_no_trainer.py file. Please help me with the following questions. I could not find clear answers in the docs. Please point me to references in case I may have missed them.

An example in the batch should look like this:
BOS example_1 EOS BOS example_2 EOS … BOS example_n EOS PAD … PAD

Q1. Is there an efficient method for grouping and padding data examples like above? My use cases: (1) pad s.t. the maximum number of data examples fit in context, (2) pad s.t. k data examples fit in the context (where k is a fixed natural number). run_clm_no_trainer.py has a custom method group_texts that does grouping (no padding) that can be modified. However, I think my use cases are pretty standard and some built-in method should exist.
Q2. [Design Choice] Is using a SEP token better than using the BOS and EOS? What are the considerations here?
Q3. What tokens do I need to update in the tokenizer and model? I have BOS, EOS, PAD and UNK (or SEP in place of BOS and EOS depending on the answer to Q2). UNK goes into the tokenizer while instantiating the tokenizer’s model (WordLevel in my case). BOS and EOS token IDs go in the model config (I am using GPT-2). What about PAD and SEP? Also, what changes are needed from my end if I want to use EOS as PAD?
Q4. HF won’t compute the loss for the predictions at special tokens, right? I read somewhere that tokenizer sets labels as -100 for special tokens which are ignored while computing the loss. Please confirm if that is correct.
Q5. What should the token_type_ids be? Since I am using GPT-2, should I care about token_type_ids? If yes, what does GPT-2 expect – same token_type_ids for all tokens or different token_type_ids for each segment (i.e. “BOS example_k EOS” gets k-th token_type_id) in a batch sentence like in BERT?

If the questions are more appropriate for the Transformers forum, please let me know. Thanks.

Topic		Replies	Views
Exhaustive list of changes across all touchpoints in the tokenization pipeline of LM training 🤗Transformers	0	288	June 26, 2023
Using Padding for ASR models 🤗Transformers	0	325	December 16, 2022
How does GPT decide to stop generating sentences without EOS token? 🤗Transformers	13	24257	August 19, 2024
Gemma-2 & Phi-3 SFT nuances Models	0	108	September 18, 2024
Tokenizer.pad_token=what? 🤗Tokenizers	2	10047	November 8, 2022

Seeking an end-to-end example of grouping, tokenization and padding to construct preprocessed data in HF

Related topics