[Beginner] fine-tune Bart with custom dataset in other language?

HeroadZ · January 22, 2021, 4:40am

@valhalla @sshleifer
Hi, I’m new to the seq2seq model. And I want to fine-tune Bart/T5 for the summarization task. There are some documents related to the fine-tuning procedure.
Such as
https://github.com/huggingface/transformers/tree/master/examples/seq2seq
https://ohmeow.github.io/blurr/modeling-seq2seq-summarization/
And also thanks for the distilbart version.

But my custom dataset is in Japanese. Directly fine-tuning might be impossible. Is it necessary to train a new bpe tokenizer with Japanese data? But I don’t know how to do it.
The second way is to use an existing Japanese tokenizer like bert-japanese, but could I just use it for Bart? How to modify it?
The third way is to use a multilingual model like MBart or MT5. I haven’t tested it. Could I just fine-tune them with the Japanese dataset ?

Please forgive me if this is a stupid question. Thanks in advance.
Thanks in advance.

valhalla · January 22, 2021, 6:30am

Hi @HeroadZ

Bart is trained in English so I don’t think fine-tuning it directly will help. If you want to train a model from scratch in a new language then yes, you should train a new tokenizer. To train a new tokenizer checkout the tokenizers library.

And both MBart and MT5 support Japanese so that would be a good starting point.

Another option is to leverage language-specific encoder only bert model (in your case bert-japanese) to create a seq2seq model using the EncoderDecoder framework. See this notebook to know more about EncoderDecoder models

Leverage BERT for Encoder-Decoder Summarization on CNN/Dailymail

HeroadZ · January 22, 2021, 7:01am

Thanks for the quick reply!

The last bert2bert is amazing!
I will try these methods. Thank you very much.

Topic		Replies	Views
Help with finetuning mBART on an unseen language Models	19	2092	October 30, 2020
Fine-Tune BART using "Fine-Tuning Custom Datasets" doc Beginners	6	9388	October 28, 2020
Fine-tuning Dataset Requirements Beginners	1	432	September 6, 2020
BART for Portuguese 🤗Transformers	7	1705	October 20, 2020
MBart Zero Shot Transfer Learning Beginners	0	355	June 4, 2021

[Beginner] fine-tune Bart with custom dataset in other language?

Related topics