[Beginner] fine-tune Bart with custom dataset in other language?

@valhalla @sshleifer
Hi, I’m new to the seq2seq model. And I want to fine-tune Bart/T5 for the summarization task. There are some documents related to the fine-tuning procedure.
Such as
https://github.com/huggingface/transformers/tree/master/examples/seq2seq
https://ohmeow.github.io/blurr/modeling-seq2seq-summarization/
And also thanks for the distilbart version.

But my custom dataset is in Japanese. Directly fine-tuning might be impossible. Is it necessary to train a new bpe tokenizer with Japanese data? But I don’t know how to do it.
The second way is to use an existing Japanese tokenizer like bert-japanese, but could I just use it for Bart? How to modify it?
The third way is to use a multilingual model like MBart or MT5. I haven’t tested it. Could I just fine-tune them with the Japanese dataset ?

Please forgive me if this is a stupid question. Thanks in advance.
Thanks in advance.

1 Like

Hi @HeroadZ

Bart is trained in English so I don’t think fine-tuning it directly will help. If you want to train a model from scratch in a new language then yes, you should train a new tokenizer. To train a new tokenizer checkout the tokenizers library.

And both MBart and MT5 support Japanese so that would be a good starting point.

Another option is to leverage language-specific encoder only bert model (in your case bert-japanese) to create a seq2seq model using the EncoderDecoder framework. See this notebook to know more about EncoderDecoder models

Leverage BERT for Encoder-Decoder Summarization on CNN/Dailymail

2 Likes

Thanks for the quick reply!

The last bert2bert is amazing!
I will try these methods. Thank you very much.