I’m wanting to summarize pretty long texts (300 to 5000 words), I have about 30k examples to work with, and I have few questions before I get started, to avoid heading off in the wrong direction.
Now I understand there are models like BART and T5, that are specifically tailored to such tasks. However there is also the option of using seperate models (say BERT and GPT2) and stitch those together via EncoderDecoderModel().
Now, to me it looks like either of those approaches should work fine. Which one will work better can probably only be found out by trying.
Is this assumption correct ?
Another thing I’m wondering about is which of these is going be more costly to train in terms of computing ressources. I’d assume that a single model approach (i.e. T5 oder BART) would use much less memory but maybe I’m missing something?
Is there theoretically a way to train BERT and GPT2 seperately in the EncoderDecoderModel() approach?
I know one can fine-tune this models on their own but is it still gonna be feasible to use them as EncoderDecoder in the model and expect good summarier after that? (I don’t think so but thought I’d ask).
Any clarifications on thos points would be greatly appreciated
Personally, I’m partial to the EncoderDecoderModel() approach, as there are BERT and GPT-2 models available in my target language, where there is only a small T5 model (not BART) available in the same language.
Hi @neuralpat, I believe you are right that you can fuse BERT with GPT-2 checkpoints with
EncoderDecoderModel although I suspect the performance may not be great given this table from the paper that this class is based on (look at the BERT2GPT row):
So you might be better off trying to find a RoBERTa model in your target language, and using that as the encoder instead of BERT.
Now whether this will work better than fine-tuning BART or T5 is not obvious to me - as with most things in deep learning you probably have to determine it empirically
Thank you for your reply and thanks for pointing this out!
There is no RoBERTa model in my arget language so I guess I’ll try BERTShare.
To make sure I correctly understand this: BERTShare means both Encoder and Decoder are BERT and they share the same weights right?
That’s great, because cutting the memory footprint is gonna help me out a lot
Yep that’s right - BERTShare (and the other “share” models) have shared weights for the encoder & decoder. Now what I don’t know is whether you can initialise this in the
EncoderDecoder class or not.
Perhaps the resident seq2seq experts (@valhalla or @patrickvonplaten) can help answer this: does
EncoderDecoder support the BERTShare model variants from the Leveraging Pre-trained Checkpoints for Sequence Generation Tasks paper?
EncoderDecoderModel.from_encoder_decoder_pretrained("bert-base-cased", "bert-base-cased", tie_encoder_decoder=True)
tie_encoder_decoder=True does exactly that
Great detective work! Not at all obvious from the
Can’t claim credit for this one, I stumbled upon Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models while looking for an answer
@lewtun can I bother you again with a follow-up?
So, I just now realized that it will be very difficult for me to process long documents as BERT’s max input length is 512. So now I’m looking for checkpoints that can process longer input sequences and this is where my question arises…
How much does it matter which task the model was trained for? For example there is a large xlm-roberta model in my target language but it was trained for token-classification. Would I even be able to utilize that?
My understanding is that any task can potentially be usefull as the models do learn things about the language they process but maybe this isn’t as universaly true as I thought?
Hi @neuralpat, no bother at all
If I understood your original aim, you’d like to perform summarization right? As far as I know, you won’t be able to use the xlm-r model fine-tuned on token classification since what you really need is a language modelling head to generate the summary.
How long are you documents? Depending on time / cost, I would be tempted to still run an experiment with the encode-decoder approach just to get a feel for how well this baseline performs on the dataset. For example, the CNN / DailyMail dataset has articles that are longer than most Transformer model’s context size, yet the summaries are not so bad.
If length is really an issue, then you might want to check out the LongFormer model: allenai/led-base-16384 · Hugging Face which can process 16k tokens
There’s also a long thread here with a discussion related to your issue, so you might find some relevant ideas there: Summarization on long documents
What is generally true is that the pretrained checkpoints can be fine-tuned on a variety of downstream tasks via transfer learning - perhaps this is what you had in mind?
Thank you for your detailed answer!
Yes, that is my aim and the documents are between 300 and 5000 words and we’re talking about domain specific documents on top of that.
Simply truncating them will probably work well on the shorter ones but the median length is about 2000 words. I could still try it but from what I can gather there is no obvious parts of the documents that I could discard and just truncating at 512 will definitely cut out important details.
I have looked at LongFormer but sadly there isn’t one in my target language (German btw) and I don’t think I’ll be able to train from scratch as my ressources are fairly limited.
Thanks for pointing me to that thread. It does seem promising to first pick the k most important sentences and then use those.
For future reference if anyone comes across this thread:
There is this model facebook/mbart-large-cc25 · Hugging Face
It seems like mBart can also be fine tuned for summarization (The docs even provide a sample for that). The nice thing is that it accepts a maximum input length of 1024 so I’ll be giving that a shot, before I do anything else