I’m researching text summarization in low-resource languages (like Sanskrit) and came across the LongT5 model. While I was hoping to use this model with AutoTrain, I was unable to find the preprocessing information. Hence, I was wondering where I can feed the preprocessing function code into the model before fine-tuning it on my .csv database with AutoTrain. Any guidance on how I can go about modifying the preprocessing code as necessary would be greatly appreciated – as I have limited time and experience, I was hoping to be able to use the streamlined AutoTrain website!
Thank you for the response. When I spoke with my mentor, he mentioned that this would be a well-scoped project. Could you please elaborate on why it is not feasible? Would I be able to upload my CSV training data file (containing text-summary pairs) and have AutoTrain iterate through the data to train the existing LongT5 model architecture?
As per my understanding, once the text is tokenized, the LongT5 training process is language-independent. Additionally, surajp/RoBERTa-hindi-guj-san includes a tokenizer for Sanskrit, if there is a way to combine its preprocessing and the LongT5 model’s architecture for my project.
This model was pretrained using a multilingual roberta. You are welcome to pretrain your own model from scratch using the longt5 approach - I am just warning you that it will probably cost thousands of dollars.
Moreover, AutoTrain will not be able to do pretraining.
You can get a decent model after 12 hours or so on v3-8 TPU, but LongT5 is likely much slower, so I would guess around 24-48 hours on v3-8 TPU and 48-96 hours on 8x A100.
For summarization, you should have at least 100, but it would be better to have 500 or 1000. There aren’t any hard rules, so take these numbers with a grain of salt.