Implimentation of Stopping Criteria List

In addition to @hatimbr ‘s comment, sometimes the same string may be mapped to different ids by the tokenizer due to preceding tokens.
Example:
In the context of given text,
{
“text”: "\n’pizza’,\n’calzone’,\n’stromboli’,\n’focaccia’,\n’flatbread’,\n’naan’,\n’roti’,\n’paratha’]"
}
last '] maps to tensor([ 525, 29962]) while my given stop sequence '] maps to tensor(2033)

As a workaround,

class StoppingCriteriaSub(StoppingCriteria):
    def __init__(self, stops = [], encounters=1):
        super().__init__()
        self.stops = [stop.to("cuda") for stop in stops]

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
        last_token = input_ids[0][-1]
        for stop in self.stops:
            if tokenizer.decode(stop) == tokenizer.decode(last_token):
                return True
        return False

to use it,

stop_words = ["]", "']", "']\n", "]\n", "\n\n", "']\n\n"]
stop_words_ids = [tokenizer(stop_word, return_tensors='pt', add_special_tokens=False)['input_ids'].squeeze() for stop_word in stop_words]
stopping_criteria = StoppingCriteriaList([StoppingCriteriaSub(stops=stop_words_ids)])

This may be slowing down text generation, so if anyone has better suggestions, I’m eager to listen.

I got Llama 2 to produce a parsable list with this.

1 Like