Dealing with Chunked Input Text and Summaries for Fine Tuning Summarization model

"Hey everyone,

I’m in the process of fine-tuning a summarization model from Hugging Face and have encountered a scenario where I’m using lengthy input texts from bank regulatory documents, alongside their corresponding comprehensive summaries. To tackle the model’s input limitations, I’ve chunked both the input text and summaries into smaller segments.

For instance, let’s say I have an arbitrary number of 5 chunks for the input text and 3 chunks for the summary. Now, I’m seeking guidance on the best methodology to appropriately feed this segmented data into the model for optimal training.

Any advice on how to handle such segmented data effectively would be highly appreciated!

Thanks in advance!"

    inputs = tokenizer(
        batch["document"],
        padding="max_length",
        truncation=True,
        return_overflowing_tokens=True,
        stride=12,
        max_length=1024,
    )
    outputs = tokenizer(
        batch["summary"],
        padding="max_length",
        truncation=True,
        return_overflowing_tokens=True,
        stride=12,
        max_length=256,
    )
)

I’m using this for the chunking (return_overflowing_tokens and stride option)

Your approach to chunking the input texts and summaries using Hugging Face’s tokenizer with the return_overflowing_tokens and stride options is a good start. Here are some additional considerations and suggestions to effectively handle segmented data for training a summarization model:

  1. Alignment of Input and Summary Chunks: Ensure that the chunks of input texts and summaries are aligned properly. Each segment of the input text should correspond to its relevant summary segment. You may need to adjust the chunking parameters or pre-process the data to maintain alignment.

  2. Batching and Padding: Since you’re using chunking, you’ll likely end up with variable-length sequences for both input texts and summaries. Make sure to batch the data appropriately and pad each batch to the same length using padding tokens. This ensures that the data can be efficiently processed in parallel during training.

  3. Batch Size and GPU Memory: Consider the batch size and available GPU memory when training with segmented data. Smaller batch sizes may be necessary to fit the segmented data into memory, especially if the individual segments are large. Experiment with different batch sizes to find a balance between training efficiency and memory usage.

  4. Loss Calculation: Decide how you want to calculate the loss for the model during training. Since you have segmented input texts and summaries, you may need to aggregate the losses from each segment to calculate the overall loss for the batch. This could involve averaging the losses or weighting them based on the length of each segment.

  5. Evaluation and Metrics: Determine how you’ll evaluate the performance of the model during training and validation. You may need to compute evaluation metrics (e.g., ROUGE scores for summarization) based on the full input texts and summaries rather than individual segments. Keep this in mind when interpreting model performance.

  6. Fine-Tuning Strategy: Experiment with different fine-tuning strategies for handling segmented data. You may try training the model with each segment independently, training with overlapping segments, or using special tokens to indicate segment boundaries. Evaluate the effectiveness of each strategy based on model performance and convergence.
    good luck :smiley: