Finetuning GPT2 with user defined loss

I have a dataset of scientific abstracts that I would like to use to finetune GPT2. However, I want to use a loss between the output of GPT2 and an N-grams model I have to adjust the weights. Is it possible to do this using huggingface transformers and if so, how? Thank you in advance!

EDIT:
Let me be a little more explicit. I would like to take the base gpt2 model and finetune it for text generation on my dataset of scientific abstracts. However, I would like to replace the loss function that the base gpt2 uses for my own that is based off an N-grams model I have. Ultimately, I would like for the finetuned model to generate scientific-sounding abstracts of a given length based off an initial sentence or two.

GPT2’s forward has a labels argument that you can use to automatically get the standard LM loss, but you don’t have to use this. You can take the model outputs and define any loss you’d like, whether using PyTorch or TF2. If you want to use Trainer, just define your own PT module that returns your custom loss as the first element from forward. See training and fine-tuning and how to train a language model.

1 Like

One caveat of using your own nn.Module with Trainer is is that save function checks for which kind of network is being passed by if is instance(self.model, PreTrainedModel) and if it is not (like nn.Module in this case or many cases if user define their own), the training stops. One thing that I’d like to propose is to have support for both and give user the warning that some functionalities won’t work which any module which inherits from PreTrainedModel provides.

So you’ll have to redefine save also. On top of that AutoModel.from_pretrained won’t directly work if you pass the path, since it expects the saved model to be an instance of PreTrainedModel, so you’ll have to manually use torch.load to load the weights.

I can send a PR @joeddav if this seems like a good idea.

Hi @prajjwal1, I think you should be able to do this by inherit from GPT2PreTrainedModel instead of nn.module , and if all you want to do is replace the loss function then you can take the code for GPT2LMHeadModel and just replace the loss calculation part.

I meant to say, that it might not always be feasible or desired to inherit from PreTrainedModel class. As you’re suggesting inheriting from GPT2PreTrainedModel is too specific. It would be good to have compatibility for nn.Module.

@prajjwal1 could you first open a GitHub issue addressing this issue tagging @julien-c and linking to this discussion?

Thank you very much for the feedback everyone! I’m new to huggingface transformers and pytorch in general so I apologize for the naive questions. I’ve written up some code that I think might accomplish what I am seeking to do (except for the training loop):

from transformers import GPT2LMHeadModel, AdamW
from transformers.modeling_outputs import CausalLMOutputWithPast

class GPT2Finetuned(GPT2LMHeadModel):
    def __init__(self):
        super().from_pretrained('gpt2')

    def forward(
        self,
        input_ids=None,
        past=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None,
        use_cache=None,
        output_attentions=None,
        output_hidden_states=None,
        return_tuple=None,
    ):
        return_tuple = return_tuple if return_tuple is not None else self.config.use_return_tuple

        transformer_outputs = self.transformer(
            input_ids,
            past=past,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_tuple=return_tuple,
        )
        hidden_states = transformer_outputs[0]
        lm_logits = self.lm_head(hidden_states)
        loss = None

        #HERE I WOULD CALCULATE THE LOSS BETWEEN THE LM_LOGITS AND MY NGRAMS MODEL??

        return CausalLMOutputWithPast(
            loss=loss,
            logits=lm_logits,
            past_key_values=transformer_outputs.past_key_values,
            hidden_states=transformer_outputs.hidden_states,
            attentions=transformer_outputs.attentions,
        )


model = GPT2Finetuned()
model.train()
optimizer = AdamW(model.parameters(), lr=1e-3)
for step in range(100):
    ????

That’s about as far as I got from looking at the different examples and the GPT2LMHeadModel class. What else needs to be added/changed in order to finetune GPT2 as described above?

You need to create a model which inherits from PreTrainedModel (you’ve done it). In the forward, you have to define how you want the loss to be computed, then return loss along with logits. If you do this way, then you can use the Trainer directly.

I can file an issue but I was thinking that if user creates their custom model by inheriting nn.Module, then AutoModel.from_pretrained won’t work directly if the user passes a path. This is because, the model has not been inherited from PreTrained class. However, PreTrained cass also inherits from nn.Module. Then the examples would have to be modified to load the weights using torch.load. I’m not sure if this is something which everyone would want (not sure). For now, I inherit from Trainer and define my custom save method for now.

1 Like

Sorry for the really basic questions, but I am truly messing this up. The following code:

from transformers import GPT2LMHeadModel

class GPT2FinetunedWithNgrams(GPT2LMHeadModel):
    def __init__(self):
        super().from_pretrained('gpt2')

model = GPT2FinetunedWithNgrams()

gives me the following error:

Traceback (most recent call last):
  File "/home/aclifton/ric-2020/text_gen_w_transformers/finetune_gpt2.py", line 9, in <module>
    model = GPT2FinetunedWithNgrams()
  File "/home/aclifton/ric-2020/text_gen_w_transformers/finetune_gpt2.py", line 7, in __init__
    super().from_pretrained('gpt2')
  File "/home/aclifton/anaconda3/lib/python3.7/site-packages/transformers/modeling_utils.py", line 675, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
TypeError: __init__() takes 1 positional argument but 2 were given

I’m not sure what to do here. Any suggestions?

super().from_pretrained('gpt2')

This line does not make too much sense. If you want to inherit from GPT2LMHeadModel, then
just do:

class GPT2FinetunedWithNgrams(GPT2LMHeadModel):
    def __init__(self, config):
        super().__init__(config)
       # your additional code here

and then:

model = GPT2FinetunedWithNgrams.from_pretrained("gpt2")

If you want to change the loss function you will have to overwrite the forward function here.

@patrickvonplaten, thank you for that. I was scratching my head about that for a bit, hahaha. Also, thank you for pointing out that the forum is better suited for this type of question than the github. I’ll update with my github post here.
Here is my updated model:

from transformers import GPT2LMHeadModel
from FeatureExtraction.NGrams import *

class GPT2FinetunedWithNgrams(GPT2LMHeadModel):
    def __init__(self, config):
        super().__init__(config)

    def load_ngrams_model(self, ngrams_model_path):
        self.ngrams_model = NGrams(ngrams_model_path)
        
    def forward(
            self,
            input_ids=None,
            past=None,
            attention_mask=None,
            token_type_ids=None,
            position_ids=None,
            head_mask=None,
            inputs_embeds=None,
            labels=None,
            use_cache=None,
            output_attentions=None,
            output_hidden_states=None,
            return_tuple=None,
    ):

        return_tuple = return_tuple if return_tuple is not None else self.config.use_return_tuple

        transformer_outputs = self.transformer(
            input_ids,
            past=past,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_tuple=return_tuple,
        )
        hidden_states = transformer_outputs[0]
        lm_logits = self.lm_head(hidden_states)

        #use gpt2 to generate a span of text based off input_ids?
        #gpt2_sent = ???

        loss = self.ngrams_model.sentence_loss(gpt2_sent)

        return (loss, lm_logits)

and here is my training script using Transformers Trainer:

from text_gen_w_transformers.finetune_gpt2 import GPT2FinetunedWithNgrams
from transformers import Trainer, TrainingArguments

model = GPT2FinetunedWithNgrams.from_pretrained('gpt2')
model.load_ngrams_model('/path/to/ngrams/model.pkl')

training_args = TrainingArguments(
    output_dir='/path/to/finetuned_gpt2',
    do_train=True,
    per_device_train_batch_size=16,
    learning_rate=1e-3,
    num_train_epochs=1,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=?????
)

trainer.train()

My questions are:

  1. You can see from the #gpt2_sent = ??? comment in the model code that I presume this is the place where I would generate a gpt2 sequence based off this version of gpt2 that is currently being finetuned. However, I am not sure what the best way to go about doing this is. Any recommendations?

  2. In the training script, I am using the Trainer module. However, I don’t understand what the train_dataset parameter is in Trainer . I have a csv file that contains one sequence per line, but I have a feeling I need to construct a Dataset object or something.

  3. I haven’t tried to run this code because I need to fill in the above 2 parts, but I also think I’m not setting any of the parameters for transformer_outputs . It looks like they are set to None and I don’t know if that will be problematic. Any thoughts on this?

I’ve been reading through the documentation and really like the library. I’m also new to it and pytorch so I apologize if my questions are pretty basic. Thanks in advance for your help!

Here’s an update to the model code:

from transformers import GPT2LMHeadModel
from FeatureExtraction.NGrams import *
import time
from transformers import GPT2Tokenizer


class GPT2FinetunedWithNgrams(GPT2LMHeadModel):
    def __init__(self, config):
        super().__init__(config)
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

    def load_ngrams_model(self, ngrams_model_path):
        self.ngrams_model = NGrams(ngrams_model_path)

    def forward(
            self,
            input_ids=None,
            past=None,
            attention_mask=None,
            token_type_ids=None,
            position_ids=None,
            head_mask=None,
            inputs_embeds=None,
            labels=None,
            use_cache=True,
    ):
  
      transformer_outputs = self.transformer(
            input_ids,
            past=past,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
        )

        hidden_states = transformer_outputs[0]
        lm_logits = self.lm_head(hidden_states)

        # Somehow convert lm_logits to new_input_ids. Then can do the following:
        # gpt2_sent_ids = super().generate(new_input_ids, max_length=50) -> generates a tensor of token_ids for the generated sentence by gpt2.
        # gpt2_sent_str = self.tokenizer.decode(gpt2_sent_ids[0], skip_special_tokens=True) -> will take the tensor of token_ids and convert it to the string.
        # loss = self.ngrams_model.sentence_loss(gpt2_sent) -> the ngrams model uses a string as input to calculate a loss.
        # return (loss, lm_logits)

I think that will work for what I need, but I welcome feedback from others. I’ve been looking through the documentation and can’t seem to find a way to convert lm_logits into new_input_ids. That’ll take care of the first question. I still don’t understand how to pass my dataset to Trainer.

1 Like

Any thoughts on this @valhalla, @prajjwal1, @patrickvonplaten, or anyone else?

Hi @aclifton314,
I won’t be able to help you with the generate part, as for the data see if this post helps you, if you still have doubts then ping me, I’ll try to provide more explanation.

@valhalla I’ll take a look at that post and see if it can help me with the data question. Thanks for your response!

@valhalla I read through that post and I see how the OP would get that error. However, I’m still not clear about creating the DataSet or DataLoader object that is expected. My guess is I need to create a custom DataSet object. I’m reading through this tutorial: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#

For context, my data is setup in a csv file. The first column has name “Title” and consists of the titles of various scientific abstracts. The second column has name “Raw” and is the raw text from those scientific abstracts. Here’s an example:

Title, Raw
DistillBERT a distilled version of BERT smaller faster cheaper and lighter, As transfer learning from large-scale pretrained models becomes more prevalent in Natural Language Processing (NLP) blah blah blah.

It seems creating a custom DataSet object wouldn’t be too difficult, but I’m not sure if that’s the best approach.

Cant you use the new :hugs: nlp library to setup the dataset? I haven’t tried creating one on a custom data as yet and to use it directly with trainer but I think they provider wrappers for CSV which should help.

@swayson That’s a good idea. I’ll take a look into it.