I was trying to finetune the “facebook/bart-base” model on a Telugu language dataset. I’ve trained my custom tokenizer and used it. But still the results are not even close to being correct. Here’s the space:
What do I do? Does the bart model only works fine for European, Slavic languages?
this is the script:
The documentation says that the BART model works for English. facebook/bart-base I think fine-tuning is not enough and you would have to pre-train the model for Telugu.
Yes it’s not possible to train a custom tokenizer and then use it with a model which already has a certain vocabulary defined. You would need to also train the model from scratch in case you trained a new tokenizer.
BART itself is an English model indeed, so I’d recommend to start from a different pre-trained model to fine-tune on Telugu, such as M2M100 or NLLB.