poli
April 26, 2021, 8:33pm
1
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast("./vocab.txt",
unk_token="<unk>",
sep_token="</s>",
cls_token="<s>",
pad_token="<pad>",
mask_token="[MASK]"
)
features = pipeline(
"feature-extraction",
model="./model",
tokenizer=tokenizer
)
search_features = features(text)
#IndexError: index out of range in self
lewtun
April 27, 2021, 1:08pm
2
what happens if you specify model_max_len=512
when you load the tokenizer? i’d try that and do a sanity check with tokenizer(text)
to make sure the truncation is working as expected
poli
April 27, 2021, 10:12pm
3
lewtun:
model_max_len=512
I’m getting the following error: IndexError: index out of range in self
lewtun
April 28, 2021, 9:42am
4
ah indeed it seems that no truncation is enabled in the base Pipeline
class: transformers/base.py at 8d43c71a1ca3ad322cc45008eb66a5611f1e017e · huggingface/transformers · GitHub
one alternative would be to extract the features directly from the model as described in this thread: Truncating sequence -- within a pipeline
this way you can enforce truncation=True
with your tokenizer and pass the truncated inputs to the model