How is the encoding done for transformers? What encoder is used?

rwheel · August 23, 2022, 5:50am

What model do you use to tokenize your data? Can you copy the code to replicate the error?

According to the documentation (Handling multiple sequences - Hugging Face Course), when the length of sequence is greater than the limit of the transformer model, a solution is to truncate your sentences, as you said.

I don’t know what is your task but, here is an example to train a causal language model (Training a causal language model from scratch - Hugging Face Course) and in the section Preparing the dataset addresses the problem of working with large contexts.

Topic		Replies	Views
How to stop at 512 tokens when sending text to pipeline? 🤗Transformers	2	1495	February 7, 2024
Max length transformers problem 🤗Transformers	0	128	March 4, 2023
Truncate the seq. not working 🤗Transformers	0	838	August 17, 2022
Tokenizer taking lot of memory 🤗Transformers	3	3515	April 16, 2023
Token indices sequence length is longer than the specified maximum sequence length for this model 🤗Transformers	1	5590	July 21, 2023

How is the encoding done for transformers? What encoder is used?

Related topics