I’m trying to train a WordPiece (WPC) tokenizer from HuggingFace on long sequences. I know the tokenizer is created successfully by looking at the saved file. When I’m trying to encode a new sequences, the tokenizer return only unknow tokens. Once shortening the sequence, the tokenizer return a valid tokens. When encoding the same sequences with a different type of tokenizer (Unigram or BPE) the tokenizers returns valid results. I’m not getting any errors or warning from the library.