Hello, I used this code to train a bart model and generate summaries
However, the summaries are coming about to be only 200-350 characters in length.
Is there some way to increase that length?
What I thought was the following options: -
encoder_max_length = 256 # demo
decoder_max_length = 64
which are used here: -
def batch_tokenize_preprocess(batch, tokenizer, max_source_length, max_target_length):
source, target = batch["document"], batch["summary"]
source_tokenized = tokenizer(
source, padding="max_length", truncation=True, max_length=max_source_length
)
target_tokenized = tokenizer(
target, padding="max_length", truncation=True, max_length=max_target_length
)
batch = {k: v for k, v in source_tokenized.items()}
# Ignore padding in the loss
batch["labels"] = [
[-100 if token == tokenizer.pad_token_id else token for token in l]
for l in target_tokenized["input_ids"]
]
return batch
train_data = train_data_txt.map(
lambda batch: batch_tokenize_preprocess(
batch, tokenizer, encoder_max_length, decoder_max_length
),
batched=True,
remove_columns=train_data_txt.column_names,
)
Also, another parameter could be :- the max_length in the model.generate() function.
def generate_summary(test_samples, model):
inputs = tokenizer(
test_samples["document"],
padding="max_length",
truncation=True,
max_length=encoder_max_length,
return_tensors="pt",
)
input_ids = inputs.input_ids.to(model.device)
attention_mask = inputs.attention_mask.to(model.device)
outputs = model.generate(input_ids, attention_mask=attention_mask)
output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)
return outputs, output_str
Which one of these should I alter to increase the length of the summary?