Finetuning GPT2 with user defined loss

What is your transformers version, you’ll need to upgrade to latest version, where the default_data_collator handles dict. Also when you return dict from your dataset, the collator automatically receives List[Dict], the list contains examples from the dataset, and its length is equal to your batch size

1 Like

Hi @valhalla. I’m using version 2.11.0 which I got from pip install. Will installing from source get me the latest version?

Yes, you can install from source or do pip install -u transformers

1 Like

The u needs to be uppercase.
pip install -U transformers

1 Like

Yes, thanks !

1 Like

Wonderful! Thanks everyone for their assistance through all of this. I feel that we are getting close to the end!

I updated to version 3.0.2 and ran the code, producing the following error:

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%|          | 0/16 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/path/to/finetune_test.py", line 28, in <module>
    trainer.train()
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 492, in train
    for step, inputs in enumerate(epoch_iterator):
  File "/path/to/anaconda3/lib/python3.7/site-packages/tqdm/std.py", line 1107, in __iter__
    for obj in iterable:
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py", line 61, in default_data_collator
    batch[k] = torch.stack([f[k] for f in features])
RuntimeError: stack expects each tensor to be equal size, but got [1, 149] at entry 0 and [1, 244] at entry 1

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%|          | 0/16 [00:00<?, ?it/s]

I assume this has to do with padding and truncating in the GPT2Tokenizer (which I briefly posted about here.

In my DataSet object, I’m going to guess that I need to set the padding and truncation parameters in

input_ids = self.tokenizer.encode(abstract_text, return_tensors='pt')
1 Like

I’m running into the exact same issue! Given input_ids, I have a target sequence as my labels. I’ve defined my __getitem__ to return a dict containing {'input_ids':..., 'attention_mask':..., 'labels':...}. I’m encoding both my input_ids and labels the same way that @aclifton314 is. I’m not sure if in my collate function I should be padding (?) input_ids according to the length of List[Features], or something like that!

@npsri, I think you’ll want to include padding and truncation. Looking at the Overview section here, there is mention that it’s probably best to pad gpt2 to the right of the sequence. Thankfully, this is pretty simple to do using the HF methods! Here’s how I setup my tokenizer:

class SDAbstractsDataset(Dataset):
    def __init__(self, csv_file):
        self.sd_abstracts_df = pd.read_csv(csv_file, encoding='ISO-8859-1')
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')

    def __len__(self):
        return len(self.sd_abstracts_df)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        abstract_text = self.sd_abstracts_df.iloc[idx, 1]
        encoded_result =self.tokenizer(abstract_text, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True)

        return {'input_ids': encoded_result['input_ids'],
                'past': None,
                'attention_mask': encoded_result['attention_mask'],
                'token_type_ids': None,
                'position_ids': None,
                'head_mask': None,
                'inputs_embeds': None,
                'labels': None,
                'use_cache': True}

You’ll see the instantiation of GPT2Tokenizer in the __init__, I added the right side padding. Then when I go to tokenize a sequence in __getitem__, I use the padding, truncation, and return_attention_mask parameters.

If you’ve been following this post, you’ll also see some changes to this DataSet subclass to the one I posted before. Just something to keep note of as I continue experimenting.

Currently, however, this tokenizer is giving me the following error:

    "Asking to pad but the tokenizer does not have a padding token. "
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

which is fine since:

training_tokenizer = sd_dataset.tokenizer
print('training_tokenizer.pad_token = {}'.format(training_tokenizer.pad_token))

[Out]: training_tokenizer.pad_token = None

So now I’m just looking through the documentation to see if there is any preference about what the padding token should be.

Hmmm, maybe I did something wrong with the padding and truncation in my DataSet object:

from torch.utils.data import Dataset
import pandas as pd
import torch
from transformers import GPT2Tokenizer

class SDAbstractsDataset(Dataset):
    def __init__(self, csv_file):
        self.sd_abstracts_df = pd.read_csv(csv_file, encoding='ISO-8859-1')
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')
        self.tokenizer.pad_token = self.tokenizer.eos_token

    def __len__(self):
        return len(self.sd_abstracts_df)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        abstract_text = self.sd_abstracts_df.iloc[idx, 1]
        encoded_result =self.tokenizer(abstract_text, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True)

        return {'input_ids': encoded_result['input_ids'],
                'past': None,
                'attention_mask': encoded_result['attention_mask'],
                'token_type_ids': None,
                'position_ids': None,
                'head_mask': None,
                'inputs_embeds': None,
                'labels': None,
                'use_cache': True}

because I’m still getting the same error:

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%|          | 0/16 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/path/to/finetune_test.py", line 38, in <module>
    trainer.train()
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 492, in train
    for step, inputs in enumerate(epoch_iterator):
  File "/path/to/anaconda3/lib/python3.7/site-packages/tqdm/std.py", line 1107, in __iter__
    for obj in iterable:
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py", line 61, in default_data_collator
    batch[k] = torch.stack([f[k] for f in features])
RuntimeError: stack expects each tensor to be equal size, but got [1, 149] at entry 0 and [1, 244] at entry 1

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%|          | 0/16 [00:00<?, ?it/s]

Any thoughts?

Your tensors are not all padded to the same length: you either need to

  • decide on a size for all tensors and add padding='max_length', max_length=that_size in your call to the tokenizer
    or
  • use a data_collator that will apply the tokenizer on the set of texts it receives, since when you pass a list of texts, the tokenizer will automatically pad them to the size of the longest when you pass padding=True like you did.
2 Likes

Thanks for your reply @sgugger, but I am a little bit confused about how to use the DataCollator. As I understand it, the DataCollator will receive a list from the DataSet object. In my case, this list will consist of dictionaries that get returned from the SDAbstractsDataSet object I’ve shown above, with each dictionary consisting of the following key-value pairs:

return {'input_ids': encoded_result['input_ids'],
                'past': None,
                'attention_mask': encoded_result['attention_mask'],
                'token_type_ids': None,
                'position_ids': None,
                'head_mask': None,
                'inputs_embeds': None,
                'labels': None,
                'use_cache': True}

If I am understanding you correctly, I should have a data collator function that looks something like this:

def sd_data_collator(dataset_samples_list):
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')
    tokenizer.pad_token = tokenizer.eos_token
    orig_seqs = []
    for samp in dataset_samples_list:
        tmp_seq = tokenizer.decode(samp['input_ids'])
        orig_seqs.append(tmp_seq)

    encoded_results = tokenizer(orig_seqs, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True)

    batch = {}
    batch['input_ids'] = torch.stack([result['input_ids'] for result in encoded_results])
    batch['past'] = torch.stack([None for result in encoded_results])
    batch['attention_mask'] = torch.stack([result['attention_mask'] for result in encoded_results])
    batch['position_ids'] = torch.stack([None for result in encoded_results])
    batch['head_mask'] = torch.stack([None for result in encoded_results])
    batch['inputs_embeds'] = torch.stack([None for result in encoded_results])
    batch['labels'] = torch.stack([None for result in encoded_results])
    batch['use_cache'] = torch.stack([True for result in encoded_results])
    return batch

Think he is suggesting you parameterise the tokenizer call.

block_size = 512  # desired length.
encoded_result = self.tokenizer(abstract_text, truncation=True, return_tensors='pt', return_attention_mask=True, padding='max_length', max_length=block_size)

No, for the data collator, your dataset would return simple texts (and labels), and the function would receive a list of texts that it can encode together. Then you can return in that function the proper dictionary.

Thanks to everyone helping out and helping me get things setup right. I really appreciate the feedback and assistance!

@sgugger: Aahh, ok I see what you’re saying. I’ve modified the DataSet object to:

from torch.utils.data import Dataset
import pandas as pd
import torch

class SDAbstractsDataset(Dataset):
    def __init__(self, csv_file):
        self.sd_abstracts_df = pd.read_csv(csv_file, encoding='ISO-8859-1')

    def __len__(self):
        return len(self.sd_abstracts_df)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        abstract_text = self.sd_abstracts_df.iloc[idx, 1]
        return abstract_text

and the DataCollator function as:

def sd_data_collator(dataset_samples_list):
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')
    tokenizer.pad_token = tokenizer.eos_token

    encoded_results = tokenizer(dataset_samples_list, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True)

    batch = {}
    batch['input_ids'] = torch.stack([result for result in encoded_results['input_ids']])
    batch['past'] = [None for result in encoded_results]
    batch['attention_mask'] = torch.stack([result for result in encoded_results['attention_mask']])
    batch['position_ids'] = [None for result in encoded_results]
    batch['head_mask'] = [None for result in encoded_results]
    batch['inputs_embeds'] = [None for result in encoded_results]
    batch['labels'] = [None for result in encoded_results]
    batch['use_cache'] = [True for result in encoded_results]
    return batch

and that gets passed to the Trainer object. However, I get the following error:

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%|          | 0/16 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/path/to/finetune_test.py", line 55, in <module>
    trainer.train()
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 499, in train
    tr_loss += self._training_step(model, inputs, optimizer)
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 622, in _training_step
    outputs = model(**inputs)
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/path/to/ric-2020/text_gen_w_transformers/finetune_gpt2.py", line 46, in forward
    orig_input_str = self.tokenizer.decode(input_ids, skip_special_tokens=True)
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 688, in decode
    filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 666, in convert_ids_to_tokens
    index = int(index)
ValueError: only one element tensors can be converted to Python scalars

@swayso: I get the same error when adding in the max_length arguments to the previous tokenizer setup but with no DataCollator:

class SDAbstractsDataset(Dataset):
    def __init__(self, csv_file):
        self.sd_abstracts_df = pd.read_csv(csv_file, encoding='ISO-8859-1')
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')
        self.tokenizer.pad_token = self.tokenizer.eos_token

    def __len__(self):
        return len(self.sd_abstracts_df)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        abstract_text = self.sd_abstracts_df.iloc[idx, 1]
        encoded_result =self.tokenizer(abstract_text, padding='max_length', max_length=512, truncation=True, return_tensors='pt', return_attention_mask=True)
        return {'input_ids': encoded_result['input_ids'],
                'past': None,
                'attention_mask': encoded_result['attention_mask'],
                'token_type_ids': None,
                'position_ids': None,
                'head_mask': None,
                'inputs_embeds': None,
                'labels': None,
                'use_cache': True}
Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%|          | 0/16 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/path/to/finetune_test.py", line 55, in <module>
    trainer.train()
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 499, in train
    tr_loss += self._training_step(model, inputs, optimizer)
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 622, in _training_step
    outputs = model(**inputs)
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/path/to/ric-2020/text_gen_w_transformers/finetune_gpt2.py", line 46, in forward
    orig_input_str = self.tokenizer.decode(input_ids, skip_special_tokens=True)
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 688, in decode
    filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 666, in convert_ids_to_tokens
    index = int(index)
ValueError: only one element tensors can be converted to Python scalars

The issue is now when you decode tensors using tokenizer.decode. Note that this method expects a 1d tensor or a list of ints, I don’t think it works on a batch. I don’t see which part of your code uses that so can’t advise more.

Yes, there are two decode lines in my forward method:

def forward(
        self,
        input_ids=None,
        past=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None,
        use_cache=True,
):
    temperature = 0.92
    tmp_input_ids = input_ids
    max_gen_length = 30
    counter = 0
    orig_input_str = self.tokenizer.decode(input_ids, skip_special_tokens=True)
    strs_to_join = orig_input_str.split()
    while counter < max_gen_length:
        transformer_outputs = self.transformer(
            tmp_input_ids,
            past=past,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
        )
        
        hidden_states = transformer_outputs[0]
        lm_logits = self.lm_head(hidden_states) / temperature
    
        last_token = lm_logits[:, -1]
        last_token_softmax = torch.softmax(last_token, dim=-1).squeeze()
        next_token = torch.multinomial(last_token_softmax, num_samples=1)
        next_gen_token_str = self.tokenizer.decode(next_token,
                                                   clean_up_tokenization_spaces=True).strip()
        strs_to_join.append(next_gen_token_str)
        new_str_input = ' '.join(strs_to_join)
        tmp_input_ids = self.tokenizer.encode(new_str_input, return_tensors='pt')
        counter += 1

    loss = self.ngrams_model.sentence_loss(new_str_input)
    return (loss, lm_logits)

I’ll go through and make those fixes and give it another go. Thanks again, truly, for your help!

@aclifton314, just to confirm, gradients cannot flow through sampling methods. If you are sampling at any stage to get token ids, and then pass these ids into your n-gram model, your gradients will be zero after this point onwards.

gradients cannot flow through the line:

next_token = torch.multinomial(last_token_softmax, num_samples=1)

so the loss function will not be optimized.

@chrisdoyleIE, I’ve updated the method a bit since that last post so let me detail it here:

I generate a sequence of a given size based off the current state of the model:

full_generated_gpt2_ids = self.generate(input_ids=tmp_input_ids,
                                                max_length=max_length,
                                                is_finetuning_current_model=True,
                                                attention_mask=attention_mask,
                                                do_sample=True,
                                                top_k=50,
                                                top_p=0.95,
                                                pad_token_id=50256
                                                )

Currently, I require about 10 sentences so I do a series of checks in a loop generating more text from the current model until I get about 10 sentences. Once I get those, I do the following:

gen_samples_tensor = torch.stack(gen_samples)
decoded_gen_samples = self.tokenizer.batch_decode(gen_samples_tensor, skip_special_tokens=True)

Then I calculate the losses of the generated samples wrt my n-grams model, average them, and return the loss:

tmp_losses = [self.ngrams_model.sentence_loss(decoded_sample) for decoded_sample in decoded_gen_samples]
losses = torch.tensor(tmp_losses, requires_grad=True)
loss = losses.mean()
return (loss,)

I’m not quite sure that I fully understand your reply, but is it still the case that gradients won’t flow through the sampling line:

full_generated_gpt2_ids = self.generate(input_ids=tmp_input_ids,
                                                max_length=max_length,
                                                is_finetuning_current_model=True,
                                                attention_mask=attention_mask,
                                                do_sample=True,
                                                top_k=50,
                                                top_p=0.95,
                                                pad_token_id=50256
                                                )

@aclifton314 yup, still the case that the gradients won’t flow through the sampling line. Check out this this post

@chrisdoyleIE, if I follow the post it boils down to the sampling methods being non-differentiable. Is this correct? If so, I would wonder about ReLU since it is not differentiable at 0. Would that seem to indicate, then, that it would be sufficient for the functions involved in backpropagation to be differentiable in the neighborhood of the current value of the weight?

I glanced back over the GPT2 and “Attention is all you need” papers and it appears that the decoder used in both papers utilizes something like argmax to sample from the probability distribution to generate the next token for calculating the loss. Maybe I am missing something, but that would seem to indicate that they would run into the same issue raised here.

I’m sure I’m missing something so feel free to walk us through it, as you have studied this material in various contexts.