Summary length for knowledge graphs vs long documents

patrickocal · December 9, 2023, 1:12am

I am training the unlimiformer (augmentation of the bart-base) model on the GovReport dataset to generate summaries. I have three experiments:

Long Documents (LDs)
Knowledge Graphs (KGs) of long documents
KG_LD (KG + LD concatenated in this order before tokenization)
Here is the odd thing:

for (1) I find that my LD summaries are always around 100 tokens.
for (2) I find my KG or KG_LD summaries converge to around 900-1000 tokens.

My model config has max_target_length=1024

  "model_name_or_path": "tau/bart-base-sled",                                      
  "use_auth_token": false,                                                         
  "max_target_length": 1024,                                                       
  "fp16": true                                                                     
}

whereas in the checkpoint config I see that “summary” has max_length=128 :

{...
    "summarization": {                                                          
      "length_penalty": 1.0,                                                    
      "max_length": 128,                                                        
      "min_length": 12,                                                         
      "num_beams": 4                                                            
    },

In data config, generation_max_length is:

{
  "dataset_name": "tau/sled",                                                      
  "dataset_config_name": "gov_report",                                             
  "max_source_length": 16384,                                                      
  "generation_max_length": 1024,                                                   
  "max_prefix_length": 0,                                                          
  "pad_prefix": false,                                                             
  "num_train_epochs": 10,                                                          
  "metric_names": ["rouge"],                                                       
  "metric_for_best_model": "rouge/geometric_mean",                                 
  "greater_is_better": true                                                        
}

The only difference in the structure of the inputs is that the KGs are input as a single string of the form:
"<s> head_1 : relation_1 : tail_1 </s><s> head_2 : relation_2 : tail_2 </s>..."

Could this be causing the model to attempt to generate a summary for every triple? It certainly doesn’t seem that way when I read the output: the summaries of the LDs are uninformative and much shorter than the golden summaries. In contrast, the summaries of the KGs are very rich in information (though often inaccurate of course).

Topic		Replies	Views
Does generate's max_length influence training? 🤗Transformers	0	103	April 25, 2024
How to increase the length of the summary in Bart_large_cnn model used via transformers.Auto_Model_frompretrained? Beginners	1	999	November 15, 2021
T5 Generates very short summaries 🤗Transformers	22	5553	September 11, 2020
Facebook/bart-large-cnn resulting in weird output Beginners	0	439	March 20, 2023
Bart input confusion Beginners	2	3897	September 14, 2020

Summary length for knowledge graphs vs long documents

Related topics