When input text length is higher than 512 it throws index out of range in self error as you can see in below:
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = PreTrainedTokenizerFast(tokenizer_file=f"{model_name}/tokenizer.json")
pipeline = TextClassificationPipeline(
model=model,
tokenizer=tokenizer
)
result = pipeline(input)
result
But if I add device to pipeline, it runs okay even input still same
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = PreTrainedTokenizerFast(tokenizer_file=f"{model_name}/tokenizer.json")
pipeline = TextClassificationPipeline(
model=model,
tokenizer=tokenizer,
device=torch.device("mps")
)
result = pipeline(input)
result
To make it work without changing device, I need to specify truncation and max_length to make it work. But I have already specified these params in tokenizer.json?
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = PreTrainedTokenizerFast(tokenizer_file=f"{model_name}/tokenizer.json")
pipeline = TextClassificationPipeline(
model=model,
tokenizer=tokenizer,
device=torch.device("mps")
)
result = pipeline(input, truncation=True, max_length=512)
result
I have truncation and padding enabled in tokenizer.json file:
{
"version": "1.0",
"truncation": {
"direction": "Right",
"max_length": 512,
"strategy": "LongestFirst",
"stride": 0
},
"padding": {
"strategy": {
"Fixed": 512
},
"direction": "Right",
"pad_to_multiple_of": null,
"pad_id": 0,
"pad_type_id": 0,
"pad_token": "[PAD]"
},
"added_tokens": [],
"normalizer": null,
"pre_tokenizer": {
"type": "ByteLevel",
"add_prefix_space": false,
"trim_offsets": true,
"use_regex": true
},
"post_processor": {
"type": "BertProcessing",
"sep": [
"</s>",
2
],
"cls": [
"<s>",
0
]
},
"decoder": {
"type": "ByteLevel",
"add_prefix_space": true,
"trim_offsets": true,
"use_regex": true
},
}
I am completely new to this. I am wondering why device(“mps”) makes make it work or why truncation doesn’t work as stated in JSON file. Any help would be greatly appreciated