Hi @cyrilw,
What model do you use to tokenize your data? Can you copy the code to replicate the error?
According to the documentation (Handling multiple sequences - Hugging Face Course), when the length of sequence is greater than the limit of the transformer model, a solution is to truncate your sentences, as you said.
I don’t know what is your task but, here is an example to train a causal language model (Training a causal language model from scratch - Hugging Face Course) and in the section Preparing the dataset addresses the problem of working with large contexts.