Why does the falcon QLoRA tutorial code use eos_token as pad_token?

Small GPT2 code example

Yes I agree that pad is assigned to eos. Eos is still eos. But during fine-tuning now the weights wrt to eos are unchanged. This might be an issue since the probability of eos has not shifted to the fine-tuning regime. One possibility is that eos is outputed with less chance. Yes we can still halt production when we see eos but we’ve not shifted the probability to output eos according to our fine-tuning distribution – but all other tokens have changed distribution. I think this could be an issue because it’s not like the old probability of eos is conserved since all tokens probs have changed except eos + even if the old eos prob was conserved, it’s wrt wrong distribution (not the fine tuning one).

e.g.,

    if tokenizer.pad_token_id is None:
        tokenizer.pad_token = tokenizer.eos_token
...
    raw_text_batch='a'
    tokenize_batch={'input_ids': tensor([[   64, 50256, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 0, 0, 0, 0]])}

but it would have been better to have

    tokenize_batch={'input_ids': tensor([[   64, 50256, 50256, 50256, 50256]]), 'attention_mask': tensor([[1, 1, 0, 0, 0]])}

code

def test_eos_pad():
    from datasets import load_dataset
    import torch
    from transformers import GPT2Tokenizer, GPT2LMHeadModel

    raw_text_batch = 'a'

    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
    # print(f'{tokenizer.eos_token=}')
    # print(f'{tokenizer.eos_token_id=}')
    # print(f'{tokenizer.pad_token=}')
    # print(f'{tokenizer.pad_token_id=}')

    # print(f'{raw_text_batch=}')
    # tokenize_batch = tokenizer(raw_text_batch, padding="max_length", max_length=5, truncation=True, return_tensors="pt")
    # print(f'{tokenize_batch=}')

    if tokenizer.pad_token_id is None:
        tokenizer.pad_token = tokenizer.eos_token
    probe_network = GPT2LMHeadModel.from_pretrained("gpt2")
    device = torch.device(f"cuda:{0}" if torch.cuda.is_available() else "cpu")
    probe_network = probe_network.to(device)

    print(f'{tokenizer.eos_token=}')
    print(f'{tokenizer.eos_token_id=}')
    print(f'{tokenizer.pad_token=}')
    print(f'{tokenizer.pad_token_id=}')

    print(f'{raw_text_batch=}')
    tokenize_batch = tokenizer(raw_text_batch, padding="max_length", max_length=5, truncation=True, return_tensors="pt")
    print(f'{tokenize_batch=}')
    print('Done')
1 Like