If device is not mps, pipeline throws index out of range in self

mrblithe · May 20, 2023, 2:11pm

When input text length is higher than 512 it throws index out of range in self error as you can see in below:

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = PreTrainedTokenizerFast(tokenizer_file=f"{model_name}/tokenizer.json")
pipeline = TextClassificationPipeline(
    model=model,
    tokenizer=tokenizer
)
result = pipeline(input)
result

But if I add device to pipeline, it runs okay even input still same

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = PreTrainedTokenizerFast(tokenizer_file=f"{model_name}/tokenizer.json")
pipeline = TextClassificationPipeline(
    model=model,
    tokenizer=tokenizer,
    device=torch.device("mps")
)
result = pipeline(input)
result

To make it work without changing device, I need to specify truncation and max_length to make it work. But I have already specified these params in tokenizer.json?

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = PreTrainedTokenizerFast(tokenizer_file=f"{model_name}/tokenizer.json")
pipeline = TextClassificationPipeline(
    model=model,
    tokenizer=tokenizer,
    device=torch.device("mps")
)
result = pipeline(input, truncation=True, max_length=512)
result

I have truncation and padding enabled in tokenizer.json file:

{
  "version": "1.0",
  "truncation": {
    "direction": "Right",
    "max_length": 512,
    "strategy": "LongestFirst",
    "stride": 0
  },
  "padding": {
    "strategy": {
      "Fixed": 512
    },
    "direction": "Right",
    "pad_to_multiple_of": null,
    "pad_id": 0,
    "pad_type_id": 0,
    "pad_token": "[PAD]"
  },
  "added_tokens": [],
  "normalizer": null,
  "pre_tokenizer": {
    "type": "ByteLevel",
    "add_prefix_space": false,
    "trim_offsets": true,
    "use_regex": true
  },
  "post_processor": {
    "type": "BertProcessing",
    "sep": [
      "</s>",
      2
    ],
    "cls": [
      "<s>",
      0
    ]
  },
  "decoder": {
    "type": "ByteLevel",
    "add_prefix_space": true,
    "trim_offsets": true,
    "use_regex": true
  },
}

I am completely new to this. I am wondering why device(“mps”) makes make it work or why truncation doesn’t work as stated in JSON file. Any help would be greatly appreciated

Topic		Replies	Views
Truncating sequence -- within a pipeline Beginners	7	5820	May 3, 2024
Out of index error in pipeline Beginners	9	6514	June 22, 2022
Error when trying to use custom pipeline Beginners	2	1445	April 7, 2023
Predictions with pipeline fails to truncate test set 🤗Transformers	0	180	January 23, 2024
How do I setup a TextClassificationPipeline that truncates token sequences Beginners	0	327	September 29, 2021

If device is not mps, pipeline throws index out of range in self

Related topics