Finetuning summarization model using long text data

RajMalkar · March 12, 2024, 4:43am

Hello, I want to fine tune pszemraj/led-base-book-summary model on my custom data of Bank Regulatory Document (15-20 pages) but the documents is well above the input token limit I can truncate it but I believe that it will cause a lot of loss of information. Can anyone suggest the right way to fine-tune using long document.

itacdonev · March 12, 2024, 8:21am

I had similar problem. What I did was to add [EOS] token (or appropriate token for your model) at the end of each document and then when tokenizing the documents used the following setup:

trx_text_encoded = tokenizer(
    text = trx_text,
    max_length=tokenizer.model_max_length,
    truncation=True,
    padding="max_length",
    return_overflowing_tokens=True,
    stride=12,
    return_tensors="pt"
)

The return_overflowing_tokens will “chunk” up extra long text into sequences and the stride will create overlapping tokens at the beginning of sequences.

RajMalkar · March 12, 2024, 1:37pm

Thank you so much for your detailed response! Your solution using the tokenizer setup is incredibly helpful. I’m planning to implement it.

I have one more question: could you please elaborate on the significance of adding the [EOS] token at the end of each document? I want to understand its role in the fine-tuning process.

itacdonev · March 13, 2024, 8:21am

The idea with the [EOS] token was to “tell the model” that there are more than one document as input. So basically, to sort of separate where one document ends and the other begins and all in between is for one document. I thought this could be helpful since the documents are “disjoint” in essence.

RajMalkar · March 14, 2024, 5:00am

Thank you that was incredibly helpful and provided me with valuable insight.

Kamaljp · March 20, 2024, 12:08am

Thanks @itacdonev this helped me also.

Didn’t know about the addtional KWargs like stride, return_overflowing_tokens. Could you please help me , how did you learn about this?

I am curating the code snippets like the one you have shared in Transformers Universe space. I want to provide additional context there. Transformers Universe - a Hugging Face Space by Kamaljp

itacdonev · March 20, 2024, 6:23am

Two main sources that I consulted first:

Then I tried the code on a very simple example like couple of sentences with max_length=4 to see it in action and whether it is behaving as expected.

Hope this help.

Topic		Replies	Views
Dealing with Chunked Input Text and Summaries for Fine Tuning Summarization model Beginners	1	1140	March 13, 2024
Finetuning transformers for long document summarisation Beginners	0	341	October 25, 2022
Summarization on long documents 🤗Transformers	63	58954	August 16, 2024
How I fine-tune BART for summarization using large texts? Research	3	3992	September 28, 2020
Which LLM is good for Text Summarization? Beginners	1	2490	October 1, 2024

Finetuning summarization model using long text data

Related topics