Hi, I’m new to the seq2seq model. And I want to fine-tune Bart/T5 for the summarization task. There are some documents related to the fine-tuning procedure.
And also thanks for the distilbart version.
But my custom dataset is in Japanese. Directly fine-tuning might be impossible. Is it necessary to train a new bpe tokenizer with Japanese data? But I don’t know how to do it.
The second way is to use an existing Japanese tokenizer like
bert-japanese, but could I just use it for Bart? How to modify it?
The third way is to use a multilingual model like MBart or MT5. I haven’t tested it. Could I just fine-tune them with the Japanese dataset ?
Please forgive me if this is a stupid question. Thanks in advance.
Thanks in advance.
Bart is trained in English so I don’t think fine-tuning it directly will help. If you want to train a model from scratch in a new language then yes, you should train a new tokenizer. To train a new tokenizer checkout the tokenizers library.
And both MBart and MT5 support Japanese so that would be a good starting point.
Another option is to leverage language-specific encoder only bert model (in your case
bert-japanese) to create a seq2seq model using the
EncoderDecoder framework. See this notebook to know more about
Leverage BERT for Encoder-Decoder Summarization on CNN/Dailymail
Thanks for the quick reply!
The last bert2bert is amazing!
I will try these methods. Thank you very much.