Finetuning summarization model using long text data

Hello, I want to fine tune pszemraj/led-base-book-summary model on my custom data of Bank Regulatory Document (15-20 pages) but the documents is well above the input token limit I can truncate it but I believe that it will cause a lot of loss of information. Can anyone suggest the right way to fine-tune using long document.

I had similar problem. What I did was to add [EOS] token (or appropriate token for your model) at the end of each document and then when tokenizing the documents used the following setup:

trx_text_encoded = tokenizer(
    text = trx_text,
    max_length=tokenizer.model_max_length,
    truncation=True,
    padding="max_length",
    return_overflowing_tokens=True,
    stride=12,
    return_tensors="pt"
)

The return_overflowing_tokens will “chunk” up extra long text into sequences and the stride will create overlapping tokens at the beginning of sequences.

1 Like

Thank you so much for your detailed response! Your solution using the tokenizer setup is incredibly helpful. I’m planning to implement it.

I have one more question: could you please elaborate on the significance of adding the [EOS] token at the end of each document? I want to understand its role in the fine-tuning process.

The idea with the [EOS] token was to “tell the model” that there are more than one document as input. So basically, to sort of separate where one document ends and the other begins and all in between is for one document. I thought this could be helpful since the documents are “disjoint” in essence.

1 Like

Thank you that was incredibly helpful and provided me with valuable insight.

Thanks @itacdonev this helped me also.

Didn’t know about the addtional KWargs like stride, return_overflowing_tokens. Could you please help me , how did you learn about this?

I am curating the code snippets like the one you have shared in Transformers Universe space. I want to provide additional context there. Transformers Universe - a Hugging Face Space by Kamaljp

Two main sources that I consulted first:

  1. HF documenatation
  2. HF Causal lanugage modeling - preparing datasets (YT video)

Then I tried the code on a very simple example like couple of sentences with max_length=4 to see it in action and whether it is behaving as expected.

Hope this help.

1 Like