Diverse Generations for pseudolabeling

@yjernite @patrickvonplaten @valhalla:
what are good kwargs to get 10 diverse summaries from bart? The top 10 beams are all > 98 ROUGE against each other. (aka barely different).

I am working a bit on pseudolabeling and have gotten huge gains from using the following strategy:

  • generate best summary with default bart parameters
  • if rouge(generated, train label) > 0.25: add (src example, generated example to dataset)

I modified the strategy a bit to consider all top 10 generations (adding at most 1 for each example), and that also works well, though I haven’t been controlled about whether it works better.

What I do know is that if you blindly add the pseudolabel to the dataset (remove bullet 2), it hurts performance.

Also if there is a systematic study on this stuff I would definitely read it!


Yacine writes:

  • Haha actually I’ve worked on that a fair bit.
  • You probably want to be sampling rather than using beam search there
  • it works better for semi-supervised training with back-translation (got some pretty OK results on gigaword at least)

I misunderstood the use case a little bit :slight_smile:

For back-translation, generating artificial examples with sampling works pretty well in my experience, especially if you’re iterating in both directions (summarization model and expansion model):

There’s also a great paper from the FAIR NY folks on self-training with beam generated outputs:

Iirc the model basically learns to reproduce the beam search output with greedy search, so it would improve greedy search performance


not quite what you are looking for, but this is an interesting paper on unsupervised summ.
which uses gpt-2 as a kind of pseudo summrizer, then score the summaries generated by it using two other models for fluency and coverage and train the summarizer using reinforcement learning to maximize the fluency and coverage score :exploding_head:

top_p=0.9, top_k=60, do_sample=True this seems to produce distinct summaries but summaries look bad

I read the FAIR self-training paper and it is very relevant, but I am struggling to understand the specifics of the approach.

Things I understood:

  • “use beam search decoding (beam size 5) to create the pseudo targets and to report BLEU on test set.” This outperforms sampling.
  • They train with dropout rate at 0.3. This helps.
  • They don’t do any cleaning or checking of the pseudolabels against the ground truth target.
  • all experiments are run on 8 GPUs with an effective batch size of 33K tokens.

Things I did not understand

In Figure 1 (page 3)

What is the difference between the light shade (pseudo-training) and dark shade (fine-tune) bars?

Section 3.2 tries to explain

(Below is excerpted then markdownd)
In Figure 1, we use green bars to show the result of applying self-training for three iterations. We include both

  • (1) pseudo-training (PT): the first step of self-training where we train a new model (from scratch) using only the pseudo parallel data generated by the current model, and
  • (2) fine- tuning (FT): the fine-tuned system using real parallel data based on the pretrained model from the PT step.
  • Note that in the fine-tuning step the system is re-initialized from scratch.
  • Surprisingly, we find that the pseudo-training step at the first iteration is able to improve BLEU even if the model is only trained on its own predictions, and fine-tuning further boosts the performance. The test BLEU keeps improving over the first three iterations, until convergence to outperform the initial baseline by 3 BLEU points.

So when they wrote: “the fine-tuned system using real parallel data based on the pretrained model from the PT step”,
I guess they mean At each interation, the fine-tuned system uses real parallel data and self-training data based on the trained model from the last step

Pseudocode of my understanding:

def fair_self_training_procedure(parallel_data, unlabeled_data, mode = 'pseudo training'):
    real_data = 100K pairs of (english-german sentences)
    unlabeled data = 3m English sentences
    pseudo_dataset = None

    # baseline
    model = randomly_initialize('transformer')
    model = train(model, original_data, dropout=0.3)
    baseline_performance = model.evaluate(original_validation_data)  # 15.6

    pseudo_dataset = (unlabeled_data, model.generate(unlabeled_data, num_beams=5))
    if mode == 'fine-tune':  # HELP
        pseudo_dataset = pseudo_dataset + parallel_data
    scores = []
    for iteration in range(3): # iteration 1,2,3
        model = randomly_initialize('transformer')  
        # Even in the fine-tuning step the system is re-initialized from scratch.
        model = train(model, pseudo_dataset, dropout=0.3)
        pseudo_dataset = (unlabeled_data, model.generate(unlabeled_data, num_beams=5))
        if mode == 'fine-tune':  # HELP
            pseudo_dataset = pseudo_dataset + parallel_data
1 Like

I read it the same way" at each iteration, a new model is initialized from scratch, then pseudo-trained (PT) on the artificial data, then fine-tuned (FT) on the real data

That is coherent with how they present it in Section 2:
In our preliminary experiments, we find that the separate training strategy with the whole pseudo parallel dataset (i.e. S = {(x, fθ(x))|x ∈ U}) produces better or equal performance for neural sequence generation while being simpler. Therefore, in the remainder of this paper we use this simpler setting.

1 Like

Just for reference , maybe useful to community on this topic :slight_smile: