Why does the falcon QLoRA tutorial code use eos_token as pad_token?

brando · August 21, 2023, 10:44pm

Small GPT2 code example

Yes I agree that pad is assigned to eos. Eos is still eos. But during fine-tuning now the weights wrt to eos are unchanged. This might be an issue since the probability of eos has not shifted to the fine-tuning regime. One possibility is that eos is outputed with less chance. Yes we can still halt production when we see eos but we’ve not shifted the probability to output eos according to our fine-tuning distribution – but all other tokens have changed distribution. I think this could be an issue because it’s not like the old probability of eos is conserved since all tokens probs have changed except eos + even if the old eos prob was conserved, it’s wrt wrong distribution (not the fine tuning one).

e.g.,

    if tokenizer.pad_token_id is None:
        tokenizer.pad_token = tokenizer.eos_token
...
    raw_text_batch='a'
    tokenize_batch={'input_ids': tensor([[   64, 50256, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 0, 0, 0, 0]])}

but it would have been better to have

    tokenize_batch={'input_ids': tensor([[   64, 50256, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 0, 0, 0]])}

code

def test_eos_pad():
    from datasets import load_dataset
    import torch
    from transformers import GPT2Tokenizer, GPT2LMHeadModel

    raw_text_batch = 'a'

    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    # print(f'{tokenizer.eos_token=}')
    # print(f'{tokenizer.eos_token_id=}')
    # print(f'{tokenizer.pad_token=}')
    # print(f'{tokenizer.pad_token_id=}')

    # print(f'{raw_text_batch=}')
    # tokenize_batch = tokenizer(raw_text_batch, padding="max_length", max_length=5, truncation=True, return_tensors="pt")
    # print(f'{tokenize_batch=}')

    if tokenizer.pad_token_id is None:
        tokenizer.pad_token = tokenizer.eos_token
    probe_network = GPT2LMHeadModel.from_pretrained("gpt2")
    device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")
    probe_network = probe_network.to(device)

    print(f'{tokenizer.eos_token=}')
    print(f'{tokenizer.eos_token_id=}')
    print(f'{tokenizer.pad_token=}')
    print(f'{tokenizer.pad_token_id=}')

    print(f'{raw_text_batch=}')
    tokenize_batch = tokenizer(raw_text_batch, padding="max_length", max_length=5, truncation=True, return_tensors="pt")
    print(f'{tokenize_batch=}')
    print('Done')

Topic		Replies	Views
Mistral trouble when fine-tuning : Don't set pad_token_id = eos_token_id 🤗Transformers	8	5962	August 28, 2024
GPT2 finetuned with eos token will never yield eos token during generation Beginners	6	3385	April 12, 2024
Transformers v3.0.0 is out! 🤗Transformers	0	1941	July 7, 2020
Labels in language modeling: which tokens to set to -100? Beginners	1	3481	November 30, 2020
Issue with finetuning a seq-to-seq model 🤗Transformers	30	3964	August 11, 2022

Why does the falcon QLoRA tutorial code use eos_token as pad_token?

Small GPT2 code example

Related topics