I have a dataset of prefixes which I would like to continue up with my LM. Further I want to build up a dataset of generated sentences and save it to csv file as raw text. Thereby I do not want delimiter to appear within generated symbols by no means. Currently I am using parameter bad_words_ids=DELIMITER_SYMBOL_ID
in GenerationConfig
of LM to ban delimiter symbol from being generated. However, theoretically this symbol could be a part of some other more complex token from tokenizer’s keys (e.g. if delimiter is !
, then token !!!
could still persist in tokenizer), which will be decoded to text with banned delimiter symbol. How do I completely erase this symbol from final text? I could only come up with filtration of resulted text to manually substitute all delimiters with some third party symbol. May be there exist some more convenient ways?