How to specify sequence length when using "feature-extraction"

from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast("./vocab.txt",
        unk_token="<unk>",
        sep_token="</s>",
        cls_token="<s>",
        pad_token="<pad>",
        mask_token="[MASK]"
)
features = pipeline(
        "feature-extraction",
        model="./model",
        tokenizer=tokenizer
        )
search_features = features(text)

#IndexError: index out of range in self

what happens if you specify model_max_len=512 when you load the tokenizer? i’d try that and do a sanity check with tokenizer(text) to make sure the truncation is working as expected

I’m getting the following error: IndexError: index out of range in self

ah indeed it seems that no truncation is enabled in the base Pipeline class: transformers/base.py at 8d43c71a1ca3ad322cc45008eb66a5611f1e017e · huggingface/transformers · GitHub

one alternative would be to extract the features directly from the model as described in this thread: Truncating sequence -- within a pipeline

this way you can enforce truncation=True with your tokenizer and pass the truncated inputs to the model