Pipeline's Tokenizer vs training tokenizer

Hello, I am trying to create a pipeline from a trained model. From what I understand I need to provide a tokenizer so that my new input will be tokenised. I guess, it should look like this;

from transformers import pipeline, AutoModel
model_name = "TestModel"

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer, return_all_scores=True)

My question is where do the other steps from th tokenisation process take place, like the padding and truncation. During training, my sequences where processed as follows;

train_encodings = tokenizer(seq_train, truncation=True, padding=True,
          max_length=1024, return_tensors="pt")

Is that no longer needed?

The pipeline does the tokenisation for you, that’s why you have to pass in a trained model and it’s tokeniser. Basically, as I understand it the pipeline implementations simply reduce the amount of code you have to write for common use cases.

1 Like