For many models, like GPT2, the generate function accepts bad_words_ids. We’re currently passing about 2500 tokenized phrases into this, and finding that it works well, but also finding that it slows down inference considerably… with 2500 phrases, we find that a generate that would take 250ms without the bad_words_ids, takes about 500ms.
Maybe there is just no solution for this, and we need to simply curtail our usage of bad_words_ids?
We were also looking at this code:
A the bad_words_ids are required to be passed in as a list. If somehow we could use a tensor instead, and put that tensor on the GPU, would that speed it up?
Any other suggestions?
thanks