The output of this last command produces text that is completely raw (basically untokenized) string- (like: ‘lorem ipsum…’) which is expected since I didn’t call tokenizer.tokenize
So does anyone have any idea how to get the model to tokenized as well? I tried a few obvious ways, but it didn’t yield anything
if you can’t reproduce the issue with the pre-trained longformer (base) tokenizer, then I can provide you my model - but I doubt it won’t since I have used that same one
so definitely, my tokenizer does tokenize the input right? this means the problem is in the fine-tuning code if I am correct. would you have a clue as to where the problem can be?
Previously, before using datasets it I was able to fine-tune it with the same code
Im getting IndexError: index out of range in self