Fine tune LongT5 mdoel


I’m researching text summarization in low-resource languages (like Sanskrit) and came across the LongT5 model. While I was hoping to use this model with AutoTrain, I was unable to find the preprocessing information. Hence, I was wondering where I can feed the preprocessing function code into the model before fine-tuning it on my .csv database with AutoTrain. Any guidance on how I can go about modifying the preprocessing code as necessary would be greatly appreciated – as I have limited time and experience, I was hoping to be able to use the streamlined AutoTrain website!

Thank you and have a great day!

Hi there,

Longt5 only works in English and there is not a way to have it be used for Sanskrit without spending thousands of dollars to pretrain it from scratch.

mt5 and mbart are multilingual models that can do summarization that might have Sanskrit, but I am not sure.

Thank you for the response. When I spoke with my mentor, he mentioned that this would be a well-scoped project. Could you please elaborate on why it is not feasible? Would I be able to upload my CSV training data file (containing text-summary pairs) and have AutoTrain iterate through the data to train the existing LongT5 model architecture?

As per my understanding, once the text is tokenized, the LongT5 training process is language-independent. Additionally, surajp/RoBERTa-hindi-guj-san includes a tokenizer for Sanskrit, if there is a way to combine its preprocessing and the LongT5 model’s architecture for my project.

This model was pretrained using a multilingual roberta. You are welcome to pretrain your own model from scratch using the longt5 approach - I am just warning you that it will probably cost thousands of dollars.

Moreover, AutoTrain will not be able to do pretraining.

Thank you for the update!

How long do you suspect training would take with LongT5? Also, are you aware of a minimum number of text-summary pairs that the LongT5 model requires?

You can get a decent model after 12 hours or so on v3-8 TPU, but LongT5 is likely much slower, so I would guess around 24-48 hours on v3-8 TPU and 48-96 hours on 8x A100.

For summarization, you should have at least 100, but it would be better to have 500 or 1000. There aren’t any hard rules, so take these numbers with a grain of salt.

1 Like