Generate function and stopping criteria - stop when generated entire word (continue if subtoken merely part of word)

linghjorten · March 3, 2023, 3:02pm

I am using the generate function to generate several possible continuations of a sentence context, including their probabilities. It is working ok, but I have some problems when words are made up of more than one token.

Since some generated tokens only constitute sub-parts of words, I need a way of only generating the output up to a word boundary. I am thinking that I should be able to solve this with a stopping_criteria but I cannot figure out how to implement this.

Below is an example:

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

model_name = 'PlanTL-GOB-ES/gpt2-large-bne'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, pad_token_id=tokenizer.eos_token_id)

context =  "Las brujas vuelan en una" # witches fly on a
input_ids = tokenizer.encode(context, return_tensors='pt')

bad_list = tokenizer([' ', ',', '.', '..', '...', '....', '.....', ':', ';', '"', '"', '?', '!',  '/', '-', 
                      '(', ')', '()', "'", ' ', ']', '['], 
                     add_special_tokens=False)

outputs = model.generate(input_ids, 
                       return_dict_in_generate=True, 
                       output_scores=True, 
                       num_return_sequences=10, 
                       num_beams=10,
                       temperature= 0.1,
                       max_new_tokens= 3,
                       bad_words_ids = bad_list.input_ids)


gen_sequences = outputs.sequences[:, input_ids.shape[-1]:]
token_list = gen_sequences.numpy().tolist()[0:]
for token in token_list:
    print(token, tokenizer.decode(token))

The input sentence is “Witches fly on a” and the generated outputs are

[749, 13750, 313]  escoba de
[749, 13750, 342]  escoba y
[749, 13750, 1192]  escoba vol
[16234, 313, 8326]  bola de cristal
[749, 13750, 341]  escoba que
[749, 13750, 21127]  escoba mágica
[313, 387, 37835]  de las carrozas
[11568, 603, 313]  carroza de
[34999, 1625, 313]  avioneta de
[749, 13750, 350]  escoba con

here, the word “escoba” (broom) consists of two subtokens, 749 and 13750 whereas the word “bola” (ball) merely consists of on token, 16234. I want some way of telling the generate function to generate function up till and including - but no more than - tokens constituting one word.

Is this possible?

Topic		Replies	Views
How to set stopping criteria in model.generate() when a certain word appears 🤗Transformers	3	3677	February 18, 2024
Ensure the sentence is complete during generation 🤗Transformers	5	7044	December 19, 2024
Ensure sentence completion at "." 🤗Transformers	0	664	March 31, 2023
Implimentation of Stopping Criteria List Beginners	24	30300	January 24, 2025
StoppingCriteria - do not include the last triggering token Intermediate	0	330	January 18, 2023

Generate function and stopping criteria - stop when generated entire word (continue if subtoken merely part of word)

Related topics