Let’s say my training dataset is just one super long string. What is the correct way to tokenize this?
I have this so far:
trainenc = tokenizer(train_dataset['text'], return_tensors='pt', max_length=128, truncation=True, padding=True, return_overflowing_tokens=True)
What arguments should i keep? Afterwards, how do I split up my long list of tokens into batches where each element of the batch is short enough to fit inside the model?
Thanks