What does "trim_offsets" do in tokenizer post-processor?

cloudydory · August 25, 2024, 2:32am

In the documentation of “Post-processers” in Huggingface’s tokenizer library, many post-processors has a argument “trim_offset”. And the explanation is like:

It also takes care of trimming the offsets. By default, the ByteLevel BPE might include whitespaces in the produced tokens. If you don’t want the offsets to include these whitespaces, then this PostProcessor should be initialized with trim_offsets=True

I am still confused by the explanation. Specifically:

Why does the “the ByteLevel BPE might include whitespaces in the produced tokens”? Is it specific to Byte-Level BPE or also happens in BPE?
What does the argument “trim_offsets” actually do? Why do we “don’t want the offsets to include these whitespaces”?

Topic		Replies	Views
Unmasking adds an extra whitespace for BPE tokenizer 🤗Tokenizers	0	271	January 14, 2024
BPEDecoder no spaces after special tokens Intermediate	4	2041	April 19, 2023
Preprocessing raw text 🤗Tokenizers	2	593	October 26, 2022
Issues with offset_mapping values 🤗Tokenizers	4	4453	February 15, 2022
Custom PostProcessor? 🤗Tokenizers	0	912	November 10, 2022

What does "trim_offsets" do in tokenizer post-processor?

Related topics