I am using the generate function to generate several possible continuations of a sentence context, including their probabilities. It is working ok, but I have some problems when words are made up of more than one token.
Since some generated tokens only constitute sub-parts of words, I need a way of only generating the output up to a word boundary. I am thinking that I should be able to solve this with a stopping_criteria
but I cannot figure out how to implement this.
Below is an example:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
model_name = 'PlanTL-GOB-ES/gpt2-large-bne'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, pad_token_id=tokenizer.eos_token_id)
context = "Las brujas vuelan en una" # witches fly on a
input_ids = tokenizer.encode(context, return_tensors='pt')
bad_list = tokenizer([' ', ',', '.', '..', '...', '....', '.....', ':', ';', '"', '"', '?', '!', '/', '-',
'(', ')', '()', "'", ' ', ']', '['],
add_special_tokens=False)
outputs = model.generate(input_ids,
return_dict_in_generate=True,
output_scores=True,
num_return_sequences=10,
num_beams=10,
temperature= 0.1,
max_new_tokens= 3,
bad_words_ids = bad_list.input_ids)
gen_sequences = outputs.sequences[:, input_ids.shape[-1]:]
token_list = gen_sequences.numpy().tolist()[0:]
for token in token_list:
print(token, tokenizer.decode(token))
The input sentence is “Witches fly on a” and the generated outputs are
[749, 13750, 313] escoba de
[749, 13750, 342] escoba y
[749, 13750, 1192] escoba vol
[16234, 313, 8326] bola de cristal
[749, 13750, 341] escoba que
[749, 13750, 21127] escoba mágica
[313, 387, 37835] de las carrozas
[11568, 603, 313] carroza de
[34999, 1625, 313] avioneta de
[749, 13750, 350] escoba con
here, the word “escoba” (broom) consists of two subtokens, 749 and 13750 whereas the word “bola” (ball) merely consists of on token, 16234. I want some way of telling the generate function to generate function up till and including - but no more than - tokens constituting one word.
Is this possible?