Thanks. Iâm totally aware of other CasualLM models (distilgpt2, gpt2âŚ). However Iâm just interested in the case of RobertaForCasualLM only. Do you know any pretrained model that use this model architecture only like the way of gpt2?
Actually, the model can be used just like GPT-2. The reason you werenât getting any results was because skip_special_tokens=True was passed to the batch_decode method.
As can be seen, the model does generate text, it however does generate special tokens (which is expected as the model needs to be fine-tuned on a downstream dataset):
from transformers import RobertaTokenizer, RobertaForCausalLM, RobertaConfig
import torch
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
config = RobertaConfig.from_pretrained("roberta-base")
config.is_decoder = True
model = RobertaForCausalLM.from_pretrained("roberta-base", config=config)
input_ids = tokenizer("The sun is", return_tensors="pt").input_ids
# generate up to 30 tokens
outputs = model.generate(input_ids, do_sample=False, max_length=30)
tokenizer.batch_decode(outputs)
returns:
['<s>The sun is</s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s>']
Training the Roberta does not help. It still generates one token i.e. </s>
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
dataset = load_dataset("lucasmccabe-lmi/CodeAlpaca-20k", split="train")
model_name = "deepset/tinyroberta-squad2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
def formatting_prompts_func(example):
output_texts = []
for i in range(len(example['instruction'])):
text = f"### Question: {example['instruction'][i]}\n ### Answer: {example['output'][i]}"
output_texts.append(text)
return output_texts
response_template = " ### Answer:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)
trainer = SFTTrainer(
model,
train_dataset=dataset,
formatting_func=formatting_prompts_func,
data_collator=collator,
)
trainer.train()
trainer.save_model("./trained/")
example = pd.DataFrame(dataset).head(n=10).iloc[0]
text = f"### Question: {example['instruction']}"
inputs = tokenizer.encode(text, return_tensors="pt")
outputs = model_trained.generate(inputs,max_new_tokens=256)
print(tokenizer.decode(outputs[0]))
This just adds one token </s> at the end.
<s>### Question: Create a function that takes a specific input and produces a specific output using any mathematical operators. Write corresponding code in Python.</s></s>
Training a RoBERTa model from âdeepset/tinyroberta-squad2â does not make a lot of sense, since that model is an encoder-only model trained for extractive question answering. Hence fine-tuning it for a different task (text generation in this case) would not result in great results. You could further fine-tune it to perform extractive question answering on a different dataset, but fine-tuning for text generation is not recommended.
Rather, one typically takes a pre-trained decoder-only LLM (which has been pre-trained for text generation already), which can then be further fine-tuned. An example is taking openai-community/gpt2 ¡ Hugging Face and further fine-tuning it for a certain text generation task.