WordPiece tokenizer doesn't work for long sequences

I’m trying to train a WordPiece (WPC) tokenizer from HuggingFace on long sequences. I know the tokenizer is created successfully by looking at the saved file. When I’m trying to encode a new sequences, the tokenizer return only unknow tokens. Once shortening the sequence, the tokenizer return a valid tokens. When encoding the same sequences with a different type of tokenizer (Unigram or BPE) the tokenizers returns valid results. I’m not getting any errors or warning from the library.

1 Like

I have encountered the exact same issue with tokenizers==0.14.1.