Fine tune LongT5 mdoel

PP04 · December 13, 2022, 12:41am

Hello!

I’m researching text summarization in low-resource languages (like Sanskrit) and came across the LongT5 model. While I was hoping to use this model with AutoTrain, I was unable to find the preprocessing information. Hence, I was wondering where I can feed the preprocessing function code into the model before fine-tuning it on my .csv database with AutoTrain. Any guidance on how I can go about modifying the preprocessing code as necessary would be greatly appreciated – as I have limited time and experience, I was hoping to be able to use the streamlined AutoTrain website!

Thank you and have a great day!

nbroad · December 14, 2022, 8:05pm

Hi there,

Longt5 only works in English and there is not a way to have it be used for Sanskrit without spending thousands of dollars to pretrain it from scratch.

mt5 and mbart are multilingual models that can do summarization that might have Sanskrit, but I am not sure.

nbroad · December 14, 2022, 11:06pm

Thank you for the response. When I spoke with my mentor, he mentioned that this would be a well-scoped project. Could you please elaborate on why it is not feasible? Would I be able to upload my CSV training data file (containing text-summary pairs) and have AutoTrain iterate through the data to train the existing LongT5 model architecture?

As per my understanding, once the text is tokenized, the LongT5 training process is language-independent. Additionally, surajp/RoBERTa-hindi-guj-san includes a tokenizer for Sanskrit, if there is a way to combine its preprocessing and the LongT5 model’s architecture for my project.

This model was pretrained using a multilingual roberta. You are welcome to pretrain your own model from scratch using the longt5 approach - I am just warning you that it will probably cost thousands of dollars.

Moreover, AutoTrain will not be able to do pretraining.

PP04 · December 15, 2022, 12:43am

Thank you for the update!

How long do you suspect training would take with LongT5? Also, are you aware of a minimum number of text-summary pairs that the LongT5 model requires?

nbroad · December 15, 2022, 11:07pm

You can get a decent model after 12 hours or so on v3-8 TPU, but LongT5 is likely much slower, so I would guess around 24-48 hours on v3-8 TPU and 48-96 hours on 8x A100.

For summarization, you should have at least 100, but it would be better to have 500 or 1000. There aren’t any hard rules, so take these numbers with a grain of salt.

Topic		Replies	Views
Struggling in Cross lingual Summarization by mt5-base Intermediate	0	297	October 11, 2023
Finetuning T5 for Summarisation - Poor results Intermediate	1	529	April 28, 2024
T5 model for summarization far from SOTA results Models	0	1344	July 2, 2021
ByT5 data preparation and pretraining Beginners	0	371	November 11, 2022
Pretrain T5 from scratch in Dutch Flax/JAX Projects	2	2091	July 7, 2021

Fine tune LongT5 mdoel

Related topics