Escape symbol appearance

broadwayj · April 16, 2024, 9:33am

I have a dataset of prefixes which I would like to continue up with my LM. Further I want to build up a dataset of generated sentences and save it to csv file as raw text. Thereby I do not want delimiter to appear within generated symbols by no means. Currently I am using parameter bad_words_ids=DELIMITER_SYMBOL_ID in GenerationConfig of LM to ban delimiter symbol from being generated. However, theoretically this symbol could be a part of some other more complex token from tokenizer’s keys (e.g. if delimiter is !, then token !!! could still persist in tokenizer), which will be decoded to text with banned delimiter symbol. How do I completely erase this symbol from final text? I could only come up with filtration of resulted text to manually substitute all delimiters with some third party symbol. May be there exist some more convenient ways?

Topic		Replies	Views
How to prevent the output from containing strange characters? Beginners	0	311	September 8, 2020
Special Characters TrOCR Beginners	3	390	May 14, 2023
StoppingCriteria - do not include the last triggering token Intermediate	0	330	January 18, 2023
How to avoid re-decoding for multiple inputs that have shared prefixes Intermediate	0	153	April 17, 2024
How to remove input from from generated text in GPTNeo? 🤗Transformers	0	987	March 1, 2022

Escape symbol appearance

Related topics