Finetuning GPT2 with user defined loss

Hi @aclifton314,
For trainer, dataset is just the normal pytorch Dataset object, you’ll only need to take of one thing.
The __getitem__ method of your dataset should return a dict whose keys should match the argument names for your models forward method. For ex, for GPT-2 the arguments are input_ids, labels, attention_mask, so your __getitem__ should return {"input_ids": [...], "labels": [...], "attention_mask": [..]}.

If for some reason you can’t return such a dict, then you’ll need to provide your own collate function which can take the batch of examples and return a dict with keys matching the forward arguments. You can pass the collate function to data_collator argument of Trainer

1 Like

@valhalla Thanks for your reply. Now I am getting the same error as the OP in this post. Here is my DataSet object:

from torch.utils.data import Dataset
import pandas as pd
from transformers import GPT2Tokenizer

class SDAbstractsDataset(Dataset):
    def __init__(self, csv_file):
        self.sd_abstracts_df = pd.read_csv(csv_file, encoding='ISO-8859-1')
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

    def __len__(self):
        return len(self.sd_abstracts_df)

    def __getitem__(self, idx):
        abstract_text = self.sd_abstracts_df.iloc[idx, 1]
        input_ids = self.tokenizer.encode(abstract_text, return_tensors='pt')
        return {'input_ids': input_ids, 'past': None,
                'attention_mask': None, 'token_type_ids': None,
                'position_ids': None, 'head_mask': None,
                'inputs_embeds': None, 'labels': None,
                'use_cache': True}

The inputs into my forward method are the same:

def forward(
            self,
            input_ids=None,
            past=None,
            attention_mask=None,
            token_type_ids=None,
            position_ids=None,
            head_mask=None,
            inputs_embeds=None,
            labels=None,
            use_cache=True,
    ):

And here is my training setup:

from text_gen_w_transformers.finetune_gpt2 import GPT2FinetunedWithNgrams
from text_gen_w_transformers.custom_dataset import SDAbstractsDataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TrainingArguments, Trainer

sd_dataset = SDAbstractsDataset('/path/to/sd_samples_64.csv')

training_args = TrainingArguments(
    output_dir='/path/to/finetuned_gpt2',
    do_train=True,
    per_device_train_batch_size=4,
    learning_rate=1e-3,
    num_train_epochs=1
)

model = GPT2FinetunedWithNgrams.from_pretrained('gpt2')

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=sd_dataset
)

trainer.train()

When that training command runs, I get the following error:

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%|          | 0/16 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/path/to/finetune_test.py", line 35, in <module>
    trainer.train()
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 464, in train
    for step, inputs in enumerate(epoch_iterator):
  File "/path/to/anaconda3/lib/python3.7/site-packages/tqdm/std.py", line 1107, in __iter__
    for obj in iterable:
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py", line 72, in collate_batch
    for k, v in vars(first).items():
TypeError: vars() argument must have __dict__ attribute

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%|          | 0/16 [00:00<?, ?it/s]

In contrast to this post, my DataSet object returns a dictionary. But, if I understand the post correctly since the Trainer I have instantiated is passing in (by default) default_data_collator, I should be returning a List[InputExamples] from my SDAbstractsDataset? That seems to contradict the next part of the post that discusses having the DataLoader return a dict with the same key-value pairs that forward expects.

What is your transformers version, you’ll need to upgrade to latest version, where the default_data_collator handles dict. Also when you return dict from your dataset, the collator automatically receives List[Dict], the list contains examples from the dataset, and its length is equal to your batch size

1 Like

Hi @valhalla. I’m using version 2.11.0 which I got from pip install. Will installing from source get me the latest version?

Yes, you can install from source or do pip install -u transformers

1 Like

The u needs to be uppercase.
pip install -U transformers

1 Like

Yes, thanks !

1 Like

Wonderful! Thanks everyone for their assistance through all of this. I feel that we are getting close to the end!

I updated to version 3.0.2 and ran the code, producing the following error:

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%|          | 0/16 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/path/to/finetune_test.py", line 28, in <module>
    trainer.train()
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 492, in train
    for step, inputs in enumerate(epoch_iterator):
  File "/path/to/anaconda3/lib/python3.7/site-packages/tqdm/std.py", line 1107, in __iter__
    for obj in iterable:
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py", line 61, in default_data_collator
    batch[k] = torch.stack([f[k] for f in features])
RuntimeError: stack expects each tensor to be equal size, but got [1, 149] at entry 0 and [1, 244] at entry 1

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%|          | 0/16 [00:00<?, ?it/s]

I assume this has to do with padding and truncating in the GPT2Tokenizer (which I briefly posted about here.

In my DataSet object, I’m going to guess that I need to set the padding and truncation parameters in

input_ids = self.tokenizer.encode(abstract_text, return_tensors='pt')
1 Like

I’m running into the exact same issue! Given input_ids, I have a target sequence as my labels. I’ve defined my __getitem__ to return a dict containing {'input_ids':..., 'attention_mask':..., 'labels':...}. I’m encoding both my input_ids and labels the same way that @aclifton314 is. I’m not sure if in my collate function I should be padding (?) input_ids according to the length of List[Features], or something like that!

@npsri, I think you’ll want to include padding and truncation. Looking at the Overview section here, there is mention that it’s probably best to pad gpt2 to the right of the sequence. Thankfully, this is pretty simple to do using the HF methods! Here’s how I setup my tokenizer:

class SDAbstractsDataset(Dataset):
    def __init__(self, csv_file):
        self.sd_abstracts_df = pd.read_csv(csv_file, encoding='ISO-8859-1')
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')

    def __len__(self):
        return len(self.sd_abstracts_df)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        abstract_text = self.sd_abstracts_df.iloc[idx, 1]
        encoded_result =self.tokenizer(abstract_text, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True)

        return {'input_ids': encoded_result['input_ids'],
                'past': None,
                'attention_mask': encoded_result['attention_mask'],
                'token_type_ids': None,
                'position_ids': None,
                'head_mask': None,
                'inputs_embeds': None,
                'labels': None,
                'use_cache': True}

You’ll see the instantiation of GPT2Tokenizer in the __init__, I added the right side padding. Then when I go to tokenize a sequence in __getitem__, I use the padding, truncation, and return_attention_mask parameters.

If you’ve been following this post, you’ll also see some changes to this DataSet subclass to the one I posted before. Just something to keep note of as I continue experimenting.

Currently, however, this tokenizer is giving me the following error:

    "Asking to pad but the tokenizer does not have a padding token. "
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

which is fine since:

training_tokenizer = sd_dataset.tokenizer
print('training_tokenizer.pad_token = {}'.format(training_tokenizer.pad_token))

[Out]: training_tokenizer.pad_token = None

So now I’m just looking through the documentation to see if there is any preference about what the padding token should be.

Hmmm, maybe I did something wrong with the padding and truncation in my DataSet object:

from torch.utils.data import Dataset
import pandas as pd
import torch
from transformers import GPT2Tokenizer

class SDAbstractsDataset(Dataset):
    def __init__(self, csv_file):
        self.sd_abstracts_df = pd.read_csv(csv_file, encoding='ISO-8859-1')
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')
        self.tokenizer.pad_token = self.tokenizer.eos_token

    def __len__(self):
        return len(self.sd_abstracts_df)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        abstract_text = self.sd_abstracts_df.iloc[idx, 1]
        encoded_result =self.tokenizer(abstract_text, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True)

        return {'input_ids': encoded_result['input_ids'],
                'past': None,
                'attention_mask': encoded_result['attention_mask'],
                'token_type_ids': None,
                'position_ids': None,
                'head_mask': None,
                'inputs_embeds': None,
                'labels': None,
                'use_cache': True}

because I’m still getting the same error:

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%|          | 0/16 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/path/to/finetune_test.py", line 38, in <module>
    trainer.train()
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 492, in train
    for step, inputs in enumerate(epoch_iterator):
  File "/path/to/anaconda3/lib/python3.7/site-packages/tqdm/std.py", line 1107, in __iter__
    for obj in iterable:
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py", line 61, in default_data_collator
    batch[k] = torch.stack([f[k] for f in features])
RuntimeError: stack expects each tensor to be equal size, but got [1, 149] at entry 0 and [1, 244] at entry 1

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%|          | 0/16 [00:00<?, ?it/s]

Any thoughts?

Your tensors are not all padded to the same length: you either need to

  • decide on a size for all tensors and add padding='max_length', max_length=that_size in your call to the tokenizer
    or
  • use a data_collator that will apply the tokenizer on the set of texts it receives, since when you pass a list of texts, the tokenizer will automatically pad them to the size of the longest when you pass padding=True like you did.
2 Likes

Thanks for your reply @sgugger, but I am a little bit confused about how to use the DataCollator. As I understand it, the DataCollator will receive a list from the DataSet object. In my case, this list will consist of dictionaries that get returned from the SDAbstractsDataSet object I’ve shown above, with each dictionary consisting of the following key-value pairs:

return {'input_ids': encoded_result['input_ids'],
                'past': None,
                'attention_mask': encoded_result['attention_mask'],
                'token_type_ids': None,
                'position_ids': None,
                'head_mask': None,
                'inputs_embeds': None,
                'labels': None,
                'use_cache': True}

If I am understanding you correctly, I should have a data collator function that looks something like this:

def sd_data_collator(dataset_samples_list):
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')
    tokenizer.pad_token = tokenizer.eos_token
    orig_seqs = []
    for samp in dataset_samples_list:
        tmp_seq = tokenizer.decode(samp['input_ids'])
        orig_seqs.append(tmp_seq)

    encoded_results = tokenizer(orig_seqs, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True)

    batch = {}
    batch['input_ids'] = torch.stack([result['input_ids'] for result in encoded_results])
    batch['past'] = torch.stack([None for result in encoded_results])
    batch['attention_mask'] = torch.stack([result['attention_mask'] for result in encoded_results])
    batch['position_ids'] = torch.stack([None for result in encoded_results])
    batch['head_mask'] = torch.stack([None for result in encoded_results])
    batch['inputs_embeds'] = torch.stack([None for result in encoded_results])
    batch['labels'] = torch.stack([None for result in encoded_results])
    batch['use_cache'] = torch.stack([True for result in encoded_results])
    return batch

Think he is suggesting you parameterise the tokenizer call.

block_size = 512  # desired length.
encoded_result = self.tokenizer(abstract_text, truncation=True, return_tensors='pt', return_attention_mask=True, padding='max_length', max_length=block_size)

No, for the data collator, your dataset would return simple texts (and labels), and the function would receive a list of texts that it can encode together. Then you can return in that function the proper dictionary.

Thanks to everyone helping out and helping me get things setup right. I really appreciate the feedback and assistance!

@sgugger: Aahh, ok I see what you’re saying. I’ve modified the DataSet object to:

from torch.utils.data import Dataset
import pandas as pd
import torch

class SDAbstractsDataset(Dataset):
    def __init__(self, csv_file):
        self.sd_abstracts_df = pd.read_csv(csv_file, encoding='ISO-8859-1')

    def __len__(self):
        return len(self.sd_abstracts_df)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        abstract_text = self.sd_abstracts_df.iloc[idx, 1]
        return abstract_text

and the DataCollator function as:

def sd_data_collator(dataset_samples_list):
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')
    tokenizer.pad_token = tokenizer.eos_token

    encoded_results = tokenizer(dataset_samples_list, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True)

    batch = {}
    batch['input_ids'] = torch.stack([result for result in encoded_results['input_ids']])
    batch['past'] = [None for result in encoded_results]
    batch['attention_mask'] = torch.stack([result for result in encoded_results['attention_mask']])
    batch['position_ids'] = [None for result in encoded_results]
    batch['head_mask'] = [None for result in encoded_results]
    batch['inputs_embeds'] = [None for result in encoded_results]
    batch['labels'] = [None for result in encoded_results]
    batch['use_cache'] = [True for result in encoded_results]
    return batch

and that gets passed to the Trainer object. However, I get the following error:

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%|          | 0/16 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/path/to/finetune_test.py", line 55, in <module>
    trainer.train()
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 499, in train
    tr_loss += self._training_step(model, inputs, optimizer)
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 622, in _training_step
    outputs = model(**inputs)
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/path/to/ric-2020/text_gen_w_transformers/finetune_gpt2.py", line 46, in forward
    orig_input_str = self.tokenizer.decode(input_ids, skip_special_tokens=True)
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 688, in decode
    filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 666, in convert_ids_to_tokens
    index = int(index)
ValueError: only one element tensors can be converted to Python scalars

@swayso: I get the same error when adding in the max_length arguments to the previous tokenizer setup but with no DataCollator:

class SDAbstractsDataset(Dataset):
    def __init__(self, csv_file):
        self.sd_abstracts_df = pd.read_csv(csv_file, encoding='ISO-8859-1')
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')
        self.tokenizer.pad_token = self.tokenizer.eos_token

    def __len__(self):
        return len(self.sd_abstracts_df)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        abstract_text = self.sd_abstracts_df.iloc[idx, 1]
        encoded_result =self.tokenizer(abstract_text, padding='max_length', max_length=512, truncation=True, return_tensors='pt', return_attention_mask=True)
        return {'input_ids': encoded_result['input_ids'],
                'past': None,
                'attention_mask': encoded_result['attention_mask'],
                'token_type_ids': None,
                'position_ids': None,
                'head_mask': None,
                'inputs_embeds': None,
                'labels': None,
                'use_cache': True}
Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%|          | 0/16 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/path/to/finetune_test.py", line 55, in <module>
    trainer.train()
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 499, in train
    tr_loss += self._training_step(model, inputs, optimizer)
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/trainer.py", line 622, in _training_step
    outputs = model(**inputs)
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/path/to/ric-2020/text_gen_w_transformers/finetune_gpt2.py", line 46, in forward
    orig_input_str = self.tokenizer.decode(input_ids, skip_special_tokens=True)
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 688, in decode
    filtered_tokens = self.convert_ids_to_tokens(token_ids, skip_special_tokens=skip_special_tokens)
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 666, in convert_ids_to_tokens
    index = int(index)
ValueError: only one element tensors can be converted to Python scalars

The issue is now when you decode tensors using tokenizer.decode. Note that this method expects a 1d tensor or a list of ints, I don’t think it works on a batch. I don’t see which part of your code uses that so can’t advise more.

Yes, there are two decode lines in my forward method:

def forward(
        self,
        input_ids=None,
        past=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None,
        use_cache=True,
):
    temperature = 0.92
    tmp_input_ids = input_ids
    max_gen_length = 30
    counter = 0
    orig_input_str = self.tokenizer.decode(input_ids, skip_special_tokens=True)
    strs_to_join = orig_input_str.split()
    while counter < max_gen_length:
        transformer_outputs = self.transformer(
            tmp_input_ids,
            past=past,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
        )
        
        hidden_states = transformer_outputs[0]
        lm_logits = self.lm_head(hidden_states) / temperature
    
        last_token = lm_logits[:, -1]
        last_token_softmax = torch.softmax(last_token, dim=-1).squeeze()
        next_token = torch.multinomial(last_token_softmax, num_samples=1)
        next_gen_token_str = self.tokenizer.decode(next_token,
                                                   clean_up_tokenization_spaces=True).strip()
        strs_to_join.append(next_gen_token_str)
        new_str_input = ' '.join(strs_to_join)
        tmp_input_ids = self.tokenizer.encode(new_str_input, return_tensors='pt')
        counter += 1

    loss = self.ngrams_model.sentence_loss(new_str_input)
    return (loss, lm_logits)

I’ll go through and make those fixes and give it another go. Thanks again, truly, for your help!

@aclifton314, just to confirm, gradients cannot flow through sampling methods. If you are sampling at any stage to get token ids, and then pass these ids into your n-gram model, your gradients will be zero after this point onwards.

gradients cannot flow through the line:

next_token = torch.multinomial(last_token_softmax, num_samples=1)

so the loss function will not be optimized.

@chrisdoyleIE, I’ve updated the method a bit since that last post so let me detail it here:

I generate a sequence of a given size based off the current state of the model:

full_generated_gpt2_ids = self.generate(input_ids=tmp_input_ids,
                                                max_length=max_length,
                                                is_finetuning_current_model=True,
                                                attention_mask=attention_mask,
                                                do_sample=True,
                                                top_k=50,
                                                top_p=0.95,
                                                pad_token_id=50256
                                                )

Currently, I require about 10 sentences so I do a series of checks in a loop generating more text from the current model until I get about 10 sentences. Once I get those, I do the following:

gen_samples_tensor = torch.stack(gen_samples)
decoded_gen_samples = self.tokenizer.batch_decode(gen_samples_tensor, skip_special_tokens=True)

Then I calculate the losses of the generated samples wrt my n-grams model, average them, and return the loss:

tmp_losses = [self.ngrams_model.sentence_loss(decoded_sample) for decoded_sample in decoded_gen_samples]
losses = torch.tensor(tmp_losses, requires_grad=True)
loss = losses.mean()
return (loss,)

I’m not quite sure that I fully understand your reply, but is it still the case that gradients won’t flow through the sampling line:

full_generated_gpt2_ids = self.generate(input_ids=tmp_input_ids,
                                                max_length=max_length,
                                                is_finetuning_current_model=True,
                                                attention_mask=attention_mask,
                                                do_sample=True,
                                                top_k=50,
                                                top_p=0.95,
                                                pad_token_id=50256
                                                )