Pipeline's Tokenizer vs training tokenizer

theudster · March 8, 2021, 2:13pm

Hello, I am trying to create a pipeline from a trained model. From what I understand I need to provide a tokenizer so that my new input will be tokenised. I guess, it should look like this;

from transformers import pipeline, AutoModel
model_name = "TestModel"

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer, return_all_scores=True)

My question is where do the other steps from th tokenisation process take place, like the padding and truncation. During training, my sequences where processed as follows;

train_encodings = tokenizer(seq_train, truncation=True, padding=True,
          max_length=1024, return_tensors="pt")

Is that no longer needed?

david-waterworth · March 8, 2021, 11:01pm

The pipeline does the tokenisation for you, that’s why you have to pass in a trained model and it’s tokeniser. Basically, as I understand it the pipeline implementations simply reduce the amount of code you have to write for common use cases.

Topic		Replies	Views
How to use pipeline for 'token-classification' with already tokenized input? Beginners	0	688	February 3, 2022
Tokenizer behaviour with pipeline 🤗Tokenizers	0	918	August 1, 2023
Extracting token embeddings from pretrained language models Beginners	9	21987	May 2, 2024
How can I use pipeline for sequence to sequence classification Beginners	0	306	May 10, 2022
How to use the model from the chapter "Fine-tuning a model with the Trainer API" Course	0	318	April 17, 2024

Pipeline's Tokenizer vs training tokenizer

Related topics