In the documentation of “Post-processers” in Huggingface’s tokenizer library, many post-processors has a argument “trim_offset”. And the explanation is like:
It also takes care of trimming the offsets. By default, the ByteLevel BPE might include whitespaces in the produced tokens. If you don’t want the offsets to include these whitespaces, then this PostProcessor should be initialized with
trim_offsets=True
I am still confused by the explanation. Specifically:
- Why does the “the ByteLevel BPE might include whitespaces in the produced tokens”? Is it specific to Byte-Level BPE or also happens in BPE?
- What does the argument “trim_offsets” actually do? Why do we “don’t want the offsets to include these whitespaces”?