Finetuning BART for Abstractive Text Summarisation

Hello All,

I have been stuck on the following for a few days and I would really appreciate some help on this.

I am currently working on an abstractive summarisation project and I am trying to finetune BART on my custom dataset. I used the finetuning script provided by hugging face as follows:

python run_summarization.py \
    --model_name_or_path facebook/bart-base \
    --do_train \
    --do_eval \
    --do_pred \
    --train_file {train_path} \
    --validation_file {validation_path} \
    --test_file {test_path} \
    --text_column full_text \
    --summary_column summary \
    --output_dir training \
    --overwrite_output_dir \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --num_train_epochs=3 \
    --predict_with_generate \
    --save_steps=500 \
    --logging_first_step=True \
    --logging_steps=500 \
    --eval_steps=500

I get promising results from the predictions generated from this training script. I am getting generated summaries of about 100 words each in the generated_predictions.txt outputted by the script. However, when I try to use the checkpoint of the fine-tuned model to generate further predictions I get terrible results:

model = BartForConditionalGeneration.from_pretrained('./training')
tokenizer = BartTokenizer.from_pretrained('./training')

max_input_length = 1024
max_target_length = 128

text = {example text of around 500 words}

model_inputs = tokenizer(text, max_length=max_input_length,
                         truncation=True, return_tensors='pt')

pred = model.generate(model_inputs['input_ids'])

tokenizer.decode(pred[0])

I get a summary of an incomplete sentence made up of 10-15 words.

I am just really confused as BART has been finetuned and I am getting completely different results relative to predictions made by the training script. My question is how can I use any checkpoint to generate the same results?

I guess you can use the script without do_train argument.

The following guide uses BartForConditionalGeneration as well. I’m not sure adding max_length makes a difference.

Or you don’t have enough data/example in your training dataset.