For many models, like GPT2, the
generate function accepts
bad_words_ids. We’re currently passing about 2500 tokenized phrases into this, and finding that it works well, but also finding that it slows down inference considerably… with 2500 phrases, we find that a
generate that would take
250ms without the
bad_words_ids, takes about
Maybe there is just no solution for this, and we need to simply curtail our usage of
We were also looking at this code:
bad_words_ids are required to be passed in as a
list. If somehow we could use a
tensor instead, and put that tensor on the GPU, would that speed it up?
Any other suggestions?