Diverse Generations for pseudolabeling

sshleifer · September 18, 2020, 2:25pm

@yjernite @patrickvonplaten @valhalla:
what are good kwargs to get 10 diverse summaries from bart? The top 10 beams are all > 98 ROUGE against each other. (aka barely different).

I am working a bit on pseudolabeling and have gotten huge gains from using the following strategy:

generate best summary with default bart parameters
if rouge(generated, train label) > 0.25: add (src example, generated example to dataset)

I modified the strategy a bit to consider all top 10 generations (adding at most 1 for each example), and that also works well, though I haven’t been controlled about whether it works better.

What I do know is that if you blindly add the pseudolabel to the dataset (remove bullet 2), it hurts performance.

Also if there is a systematic study on this stuff I would definitely read it!

sshleifer · September 18, 2020, 2:27pm

Yacine writes:

Haha actually I’ve worked on that a fair bit.
You probably want to be sampling rather than using beam search there
it works better for semi-supervised training with back-translation (got some pretty OK results on gigaword at least)

yjernite · September 18, 2020, 2:32pm

I misunderstood the use case a little bit

For back-translation, generating artificial examples with sampling works pretty well in my experience, especially if you’re iterating in both directions (summarization model and expansion model):

There’s also a great paper from the FAIR NY folks on self-training with beam generated outputs:

Iirc the model basically learns to reproduce the beam search output with greedy search, so it would improve greedy search performance

valhalla · September 18, 2020, 4:35pm

not quite what you are looking for, but this is an interesting paper on unsupervised summ.
which uses gpt-2 as a kind of pseudo summrizer, then score the summaries generated by it using two other models for fluency and coverage and train the summarizer using reinforcement learning to maximize the fluency and coverage score

valhalla · September 18, 2020, 6:04pm

top_p=0.9, top_k=60, do_sample=True this seems to produce distinct summaries but summaries look bad

sshleifer · September 19, 2020, 4:01pm

I read the FAIR self-training paper and it is very relevant, but I am struggling to understand the specifics of the approach.

Things I understood:

“use beam search decoding (beam size 5) to create the pseudo targets and to report BLEU on test set.” This outperforms sampling.
They train with dropout rate at 0.3. This helps.
They don’t do any cleaning or checking of the pseudolabels against the ground truth target.
all experiments are run on 8 GPUs with an effective batch size of 33K tokens.

Things I did not understand

In Figure 1 (page 3)

What is the difference between the light shade (pseudo-training) and dark shade (fine-tune) bars?

Section 3.2 tries to explain

(Below is excerpted then markdownd)
In Figure 1, we use green bars to show the result of applying self-training for three iterations. We include both

(1) pseudo-training (PT): the first step of self-training where we train a new model (from scratch) using only the pseudo parallel data generated by the current model, and
(2) fine- tuning (FT): the fine-tuned system using real parallel data based on the pretrained model from the PT step.
Note that in the fine-tuning step the system is re-initialized from scratch.
Surprisingly, we find that the pseudo-training step at the first iteration is able to improve BLEU even if the model is only trained on its own predictions, and fine-tuning further boosts the performance. The test BLEU keeps improving over the first three iterations, until convergence to outperform the initial baseline by 3 BLEU points.

So when they wrote: “the fine-tuned system using real parallel data based on the pretrained model from the PT step”,
I guess they mean At each interation, the fine-tuned system uses real parallel data and self-training data based on the trained model from the last step

Pseudocode of my understanding:

def fair_self_training_procedure(parallel_data, unlabeled_data, mode = 'pseudo training'):
    real_data = 100K pairs of (english-german sentences)
    unlabeled data = 3m English sentences
    pseudo_dataset = None

    # baseline
    model = randomly_initialize('transformer')
    model = train(model, original_data, dropout=0.3)
    baseline_performance = model.evaluate(original_validation_data)  # 15.6

    pseudo_dataset = (unlabeled_data, model.generate(unlabeled_data, num_beams=5))
    if mode == 'fine-tune':  # HELP
        pseudo_dataset = pseudo_dataset + parallel_data
    scores = []
    for iteration in range(3): # iteration 1,2,3
        model = randomly_initialize('transformer')  
        # Even in the fine-tuning step the system is re-initialized from scratch.
        inject_noise(pseudo_dataset)
        model = train(model, pseudo_dataset, dropout=0.3)
        scores.append(model.evalute(original_validation_data))
        pseudo_dataset = (unlabeled_data, model.generate(unlabeled_data, num_beams=5))
        if mode == 'fine-tune':  # HELP
            pseudo_dataset = pseudo_dataset + parallel_data

yjernite · September 24, 2020, 2:20pm

I read it the same way" at each iteration, a new model is initialized from scratch, then pseudo-trained (PT) on the artificial data, then fine-tuned (FT) on the real data

That is coherent with how they present it in Section 2:
In our preliminary experiments, we find that the separate training strategy with the whole pseudo parallel dataset (i.e. S = {(x, fθ(x))|x ∈ U}) produces better or equal performance for neural sequence generation while being simpler. Therefore, in the remainder of this paper we use this simpler setting.

Jung · December 10, 2020, 9:36am

Just for reference , maybe useful to community on this topic

github.com/huggingface/transformers

Diverse beam search

huggingface:master ← ayushtiku5:diverse_beam_search

opened 05:42PM - 18 Nov 20 UTC

ayushtiku5

+590 -20

# What does this PR do?  Implementation of diverse beam search decoding as described in the paper: https://arxiv.org/pdf/1610.02424.pdf diversity function reference taken from: https://github.com/ashwinkalyan/dbs ## Implementation details Consider a T5 summarization task. `article="Justin Timberlake and Jessica Biel, welcome to parenthood. The celebrity couple announced the arrival of their son, Silas Randall Timberlake, in statements to People. "Silas was the middle name of Timberlake's maternal grandfather Bill Bomar, who died in 2012, while Randall is the musician's own middle name, as well as his father's first," People reports. The couple announced the pregnancy in January, with an Instagram post. It is the first baby for both."` Generation using normal beam search can be done as: `model.generate( input_ids=input_ids, num_beams=2, num_return_sequences=2 )` This generates: `['the couple announced the pregnancy in January. it is the first baby for both.', 'the couple announced the pregnancy in January. it is the first baby for both of them ']` Generation using diverse beam search can be done as: `model.generate( input_ids=input_ids, num_beams=2, num_return_sequences=2, beam_groups=2, diversity_penalty=1.5 )` This generates: `['the couple announced the pregnancy in January. it is the first baby for both.', 'Justin Timberlake and Jessica Biel have welcomed their son, Silas Randall ']` This means that 2 beams will be divided into 2 groups of 1 beam each, ensuring diversity between each group. NOTE: If `beam_groups=1`, then it will be same as the normal beam search as all the beams belong to the same group. Higher `diversity_penalty` will ensure more diversity between the groups of beams. When doing generation using diverse beam search, we need to ensure that `num_beams>=beam_groups` and also `num_beams` is divisible by `beam_groups`. ## Who can review? @patrickvonplaten, @TevenLeScao

Topic		Replies	Views
[HELP]Bart summarization output exactly the same as labels 🤗Transformers	3	852	August 4, 2021
BART summarization: strategies to improve entity preservation Models	0	499	November 3, 2021
Is this the right way prompt summarization with BART? 🤗Transformers	1	2081	March 18, 2023
Train Bart for Conditional Generation (e.g. Summarization) Models	14	17161	November 22, 2023
Pretraining BART for conditional generation 🤗Transformers	1	978	May 30, 2022

Diverse Generations for pseudolabeling

Things I understood:

Things I did not understand

Section 3.2 tries to explain

Pseudocode of my understanding:

Related topics