How to specify sequence length when using "feature-extraction"

poli · April 26, 2021, 8:33pm

from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast("./vocab.txt",
        unk_token="<unk>",
        sep_token="</s>",
        cls_token="<s>",
        pad_token="<pad>",
        mask_token="[MASK]"
)
features = pipeline(
        "feature-extraction",
        model="./model",
        tokenizer=tokenizer
        )
search_features = features(text)

#IndexError: index out of range in self

lewtun · April 27, 2021, 1:08pm

what happens if you specify model_max_len=512 when you load the tokenizer? i’d try that and do a sanity check with tokenizer(text) to make sure the truncation is working as expected

poli · April 27, 2021, 10:12pm

I’m getting the following error: IndexError: index out of range in self

lewtun · April 28, 2021, 9:42am

ah indeed it seems that no truncation is enabled in the base Pipeline class: transformers/base.py at 8d43c71a1ca3ad322cc45008eb66a5611f1e017e · huggingface/transformers · GitHub

one alternative would be to extract the features directly from the model as described in this thread: Truncating sequence -- within a pipeline

this way you can enforce truncation=True with your tokenizer and pass the truncated inputs to the model

Topic		Replies	Views
Truncating sequence -- within a pipeline Beginners	7	5836	May 3, 2024
Out of index error in pipeline Beginners	9	6518	June 22, 2022
Predictions with pipeline fails to truncate test set 🤗Transformers	0	181	January 23, 2024
Tokenizer behaviour with pipeline 🤗Tokenizers	0	926	August 1, 2023
Truncate the seq. not working 🤗Transformers	0	836	August 17, 2022

How to specify sequence length when using "feature-extraction"

Related topics