Generate method during finetuning

I am inheriting a mode pre-trained model:

class GPT2FinetunedWithNgrams(GPT2LMHeadModel):
    @timer
    def __init__(self, config, model_tokenizer=None):
        super().__init__(config)
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')
        self.tokenizer.pad_token = self.tokenizer.eos_token

and in the forward method during finetuning, I need to generate sequences from this model being finetuned:

def forward(
            self,
            input_ids=None,
            past=None,
            attention_mask=None,
            token_type_ids=None,
            position_ids=None,
            head_mask=None,
            inputs_embeds=None,
            labels=None,
            use_cache=True,
    ):
    beam_output = self.generate(
                           input_ids,
                           max_length=50,
                           num_beams=5,
                           early_stopping=True)
#Pass beam_output to different loss function and return loss

My question is, will using the generate method use the weights for the current model that is being finetuned or will it use static weights from some other GPT2 model?

I ran this code and it looks like I’m getting a recursion error:

def sd_data_collator(dataset_samples_list):
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')
    tokenizer.pad_token = tokenizer.eos_token

    encoded_results = tokenizer(dataset_samples_list, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True)

    batch = {}
    batch['input_ids'] = torch.stack([result for result in encoded_results['input_ids']])
    batch['past'] = None
    batch['attention_mask'] = torch.stack([result for result in encoded_results['attention_mask']])
    batch['position_ids'] = None
    batch['head_mask'] = None
    batch['inputs_embeds'] = None
    batch['labels'] = None
    batch['use_cache'] = True
    return batch

sd_dataset = SDAbstractsDataset('/path/to/sd_samples_64.csv')

training_args = TrainingArguments(
    output_dir='/path/to/finetuned_gpt2',
    do_train=True,
    per_device_train_batch_size=4,
    learning_rate=1e-3,
    num_train_epochs=1
)

model = GPT2FinetunedWithNgrams.from_pretrained('gpt2')

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=sd_dataset,
    data_collator = sd_data_collator
)

trainer.train()
class GPT2FinetunedWithNgrams(GPT2LMHeadModel):
    def __init__(self, config, model_tokenizer=None):
        super().__init__(config)
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')
        self.tokenizer.pad_token = self.tokenizer.eos_token

    def load_ngrams_model(self, ngrams_model_path):
        self.ngrams_model = NGrams(ngrams_model_path)

    def forward(
            self,
            input_ids=None,
            past=None,
            attention_mask=None,
            token_type_ids=None,
            position_ids=None,
            head_mask=None,
            inputs_embeds=None,
            labels=None,
            use_cache=True,
    ):

        output = self.generate(input_ids=input_ids, max_length=470)

Here’s the whole error (it’s really lengthy):

Some weights of GPT2FinetunedWithNgrams were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch:   0%|          | 0/1 [00:00<?, ?it/s]
Iteration:   0%|          | 0/16 [00:00<?, ?it/s]Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence
Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence
Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence
Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence
.
.
.
      File "/path/to/anaconda3/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/generation_utils.py", line 480, in generate
    model_specific_kwargs=model_specific_kwargs,
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/generation_utils.py", line 520, in _generate_no_beam_search
    outputs = self(**model_inputs)
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/path/to/ric-2020/text_gen_w_transformers/finetune_gpt2.py", line 33, in forward
    
  File "/path/to/anaconda3/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/path/to/anaconda3/lib/python3.7/site-packages/transformers/generation_utils.py", line 350, in generate
    "Setting `pad_token_id` to {} (first `eos_token_id`) to generate sequence".format(eos_token_id)
.
.
.
  File "/path/to/anaconda3/lib/python3.7/logging/__init__.py", line 1390, in warning
    self._log(WARNING, msg, args, **kwargs)
  File "/path/to/anaconda3/lib/python3.7/logging/__init__.py", line 1514, in _log
    self.handle(record)
  File "/path/to/anaconda3/lib/python3.7/logging/__init__.py", line 1524, in handle
    self.callHandlers(record)
  File "/path/to/anaconda3/lib/python3.7/logging/__init__.py", line 1594, in callHandlers
    lastResort.handle(record)
  File "/path/to/anaconda3/lib/python3.7/logging/__init__.py", line 894, in handle
    self.emit(record)
  File "/path/to/anaconda3/lib/python3.7/logging/__init__.py", line 1025, in emit
    msg = self.format(record)
  File "/path/to/anaconda3/lib/python3.7/logging/__init__.py", line 869, in format
    return fmt.format(record)
  File "/path/to/anaconda3/lib/python3.7/logging/__init__.py", line 608, in format
    record.message = record.getMessage()
  File "/path/to/anaconda3/lib/python3.7/logging/__init__.py", line 360, in getMessage
    def getMessage(self):
RecursionError: maximum recursion depth exceeded while calling a Python object

My guess is the self.generate() being called within the model produces the recursion problem.

Is it possible to use the functionality within the generate method (like beam search, top-k, etc.) without causing this recursion error during finetuning?

Not sure about the recursion bug, but here’s an implementation of top-k, top-p that you can use

1 Like

Which version of transformers are you using? We fixed a similar tokenizer recursion bug recently.

@thomwolf 3.0.2

I see. Do you think you could to post a full and self-contained code example reproducing the bug?

@thomwolf, no problem. Here you go. Hopefully it works on your end:

Module Versions:

  • transformers 3.0.2
  • torch 1.5.1
  • pandas 1.0.5

Code:

from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments
from torch.utils.data import Dataset
import torch
import pandas as pd

class TmpGPT2(GPT2LMHeadModel):
    def __init__(self, config):
        super().__init__(config)
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')
        self.tokenizer.pad_token = self.tokenizer.eos_token

    def forward(
            self,
            input_ids=None,
            past=None,
            attention_mask=None,
            token_type_ids=None,
            position_ids=None,
            head_mask=None,
            inputs_embeds=None,
            labels=None,
            use_cache=True,
    ):

        max_length = input_ids.shape[1] + 25
        generated_gpt2_ids = self.generate(input_ids=input_ids, max_length=max_length, attention_mask=attention_mask)
        #decoded_output_generated_gpt2 = self.tokenizer.batch_decode(generated_gpt2_ids, skip_special_tokens=True)
        return None

class TmpDataset(Dataset):
    def __init__(self, text_dict):
        self.data_df = pd.DataFrame.from_dict(text_dict)
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')
        self.tokenizer.pad_token = self.tokenizer.eos_token

    def __len__(self):
        return len(self.data_df)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()
        text = self.data_df.iloc[idx, 1]
        return text


def tmp_data_collator(dataset_samples_list):
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2', padding_side='right')
    tokenizer.pad_token = tokenizer.eos_token

    encoded_results = tokenizer(dataset_samples_list, padding=True, truncation=True, return_tensors='pt', return_attention_mask=True)

    batch = {}
    batch['input_ids'] = torch.stack([result for result in encoded_results['input_ids']])
    batch['past'] = None
    batch['attention_mask'] = torch.stack([result for result in encoded_results['attention_mask']])
    batch['position_ids'] = None
    batch['head_mask'] = None
    batch['inputs_embeds'] = None
    batch['labels'] = None
    batch['use_cache'] = True
    return batch

title1 = 'Double Chooze paper'
title2 = 'Novel Control System paper'
title3 = 'Waste water paper'
title4 = 'Fuzzy predictive control algo paper'
text_titles = [title1, title2, title3, title4]

prompt1 = 'We present an update on the results of the Double Chooz experiment. Double Chooz searches for the neutrino mixing angle, θ13, in the three-neutrino mixing matrix via the disappearance of produced by the dual 4.27 GW/th Chooz B Reactors. Here we discuss updated oscillation fit results using both the rate and the shape of the anti-neutrino energy spectrum. In the most recent oscillation analysis we included data with neutron captures on Gadolinium and Hydrogen along with the reactor off data that we collected. This is an important step in our multi-year program to establish the value of θ13.'
prompt2 = 'The paper covers detailed discussion on novel control system developed for adaptive fluid-based shock-absorbers serving for mitigation of unknown impact excitations. In order to provide complete independence of the control system from the loading conditions, the Hybrid Prediction Control (HPC) was elaborated. The proposed method is an extension of previously introduced kinematic feedback control which ensures optimal path finding, tracking and path update in case of high disturbance or sudden change of loading conditions. Implementation of the presented control system allows to obtain self-adaptive fluid-based absorbers providing robust impact mitigation. In contrast to previously developed methods of Adaptive Impact Absorption, the proposed control strategy does not require prior knowledge of impact excitation or its preliminary identification. The independence of applied control system from parameters of impact loading results in the capability of automatic path correction in the case of disturbance occurrence and re-adaptation to a number of subsequent impacts. The successful operation of the self-adaptive system is investigated with the use of numerical examples involving double-chamber pneumatic shock-absorber equipped with controllable valve. Efficiency of the HPC is proved by comparison with passive absorber as well as device equipped with adaptive and optimal control modules.'
prompt3 = 'This study aimed to produce biosurfactant from Pseudozyma tsukubaensis using cassava wastewater and an inoculum (biomass) for galactooligosaccharides synthesis from lactose as an integrated system. First, the use of cassava wastewater as a low cost culture medium by P. tsukubaensis to produce biomass and biosurfactant was evaluated and optimized. Then, the microbial cells (biomass) obtained from the optimized process were used to produce galactooligosaccharides from lactose. The optimum conditions for biosurfactant and biomass synthesis were found to be 80% (v/v) of cassava wastewater at 30°C and 200rpm for 48h. The highest concentration of biosurfactant, that is, minimum surface tension value and maximum biomass concentration predicted were experimentally confirmed as 26.87mN/m and 10.5g/L, respectively. The biosurfactant obtained showed good thermal (121°C/1h), pH (2–11) and ionic strength (0–25% NaCl) stability. Excellent emulsifier activity was also verified, suggesting a potential application in enhanced oil recovery. Galactooligosaccharides synthesized by the Kluyveromyces genus have been extensively investigated, however, few studies have reported transgalactosylation ability by other yeast genera. The transgalactosylation activity of the yeast biomass at optimized conditions from 40% (w/w) lactose resulted in galactooligosaccharides production of 73.12g/L and a yield of 18.28% (w/w) at pH 8.0 and 30°C in 24h. This research showed the technical feasibility of an integrated process: biosurfactant and GOS production from P. tsukubaensis, which takes advantage of the remarkable metabolism of this microorganism. To the best of our knowledge, this is the first study reporting the potential of P. tsukubaensis to produce two economical biotechnological products of increase interest as an integrated process.'
prompt4 = 'Advantages of a fuzzy predictive control algorithm are discussed in the paper. The fuzzy predictive algorithm is a combination of a DMC (Dynamic Matrix Control) algorithm and Takagi–Sugeno fuzzy modeling, thus it inherits advantages of both techniques. The algorithm is numerically effective. It is in fact generalization of the standard DMC algorithm widely used in the industry, thus the existing implementations of the DMC algorithm can be extended using the presented fuzzy approach. A simple and easy to apply method of fuzzy predictive control algorithms synthesis is presented in the paper. It can be easy applied also in the case of Multiple Input Multiple Output (MIMO) control plants. Moreover, information about measured disturbance can be included in the algorithms in an easy way. The advantages of the fuzzy predictive control algorithm are demonstrated in the example control systems of two nonlinear chemical reactors: the first one—with inverse response and the second one—a MIMO plant with time delay.'
texts = [prompt1, prompt2, prompt3, prompt4]

text_dict = {'title': text_titles, 'text': texts}
tmp_dataset = TmpDataset(text_dict)

training_args = TrainingArguments(
    output_dir='YOUR OUTPUT DIR',
    do_train=True,
    per_device_train_batch_size=2,
    learning_rate=1e-3,
    num_train_epochs=1
)

model = TmpGPT2.from_pretrained('gpt2')

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tmp_dataset,
    data_collator=tmp_data_collator
)

trainer.train()