'Type Error: list object cannot be interpreted as integer' while evaluating a summarization model (seq2seq,BART)

Hello all, I have been using this code:-

to learn training a summarization model. However, since I needed an extractive model, I replacedsshleifer/distilbart-xsum-12-3’ with “facebook/bart-large-cnn” for both
AutoModelForSeq2SeqLM.from_pretrained & AutoTokenizer.from_pretrained

I am able to train the model and get two different summaries (one before the model is trained and one after the model is trained). But the summaries are abstractive so I changed one option in the training_args (predict_with_generate) to FALSE.


training_args = Seq2SeqTrainingArguments(
    output_dir="results",
    num_train_epochs=1,  # demo
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=4,  # demo
    per_device_eval_batch_size=4,
    # learning_rate=3e-05,
    warmup_steps=500,
    weight_decay=0.1,
    label_smoothing_factor=0.1,
    **predict_with_generate=False,**
    logging_dir="logs",
    logging_steps=50,
    save_total_limit=3,
)

However, after doing this, I get an error while running trainer.evaluate() :-1:

text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)

TypeError: 'list' object cannot be interpreted as an integer

And if I comment the option, the code runs albeit without the metrics (rouge etc) and Iam able to get extractive summaries.

Can anyone help in clearing this error so that I can run extractive summaries and get the metrics as well?

Thanks!

Hey @gildesh I’m not sure why you say BART will provide extractive summaries - my understanding is that it is an encoder-decoder Transformer, so the decoder will generate summaries if trained to do so.

In any case, the reason why you get an error with predict_with_generate=False is because the Trainer won’t call the model’s generate() method in that case (it just computes the loss / logits, which is why you don’t see the metrics).

So if you want to compute things like ROUGE during training, you’ll need to generate the summaries with predict_with_generate=True

PS the notebook you shared looks more complicated than it needs to be. I recommend using the official summarization example as a foundation (which will certainly work with BART)

1 Like

Thanks lewtun!
Can you recommend any other type of model for extractive summarization, especially if I want to train it further?

Also, how to make sure that the summaries aren’t cut short in between? I am getting sentences like “your sling. Cut a piece of fabric …”. “Your sling” is a portion of the text that was cut in middle. How to avoid this?

I tried two ways :- one is
i)changing max_length in model.generate(input_ids, attention_mask=attention_mask,max_length=300)
2) varying encoder_max_length in ‘batch_tokenize’

which is the correct method? If none, can you suggest?

Hey @gildesh, extractive summarization is usually framed as a ranking task, where you chunk your document into sentences and then select the top-N sentences that are most similar to the summary.

So for this approach, you would probably want to take an embedding-based approach, using e.g. sentence-transformers. There’s a nice blog post about this approach here

There are advanced models like HiBERT, but I’m not sure if the complexity is worth it compared to just using abstractive models like BART

1 Like

Hey @gildesh, someone in the community has made a nice Space that showcases extractive summarisation (along the lines I described): Allen Roush on LinkedIn: Unsupervised_Extractive_Summarization - a Hugging Face Space by Hellisotherpeople