@yjernite@patrickvonplaten@valhalla:
what are good kwargs to get 10 diverse summaries from bart? The top 10 beams are all > 98 ROUGE against each other. (aka barely different).
I am working a bit on pseudolabeling and have gotten huge gains from using the following strategy:
generate best summary with default bart parameters
if rouge(generated, train label) > 0.25: add (src example, generated example to dataset)
I modified the strategy a bit to consider all top 10 generations (adding at most 1 for each example), and that also works well, though I havenât been controlled about whether it works better.
What I do know is that if you blindly add the pseudolabel to the dataset (remove bullet 2), it hurts performance.
Also if there is a systematic study on this stuff I would definitely read it!
For back-translation, generating artificial examples with sampling works pretty well in my experience, especially if youâre iterating in both directions (summarization model and expansion model):
Thereâs also a great paper from the FAIR NY folks on self-training with beam generated outputs:
Iirc the model basically learns to reproduce the beam search output with greedy search, so it would improve greedy search performance
not quite what you are looking for, but this is an interesting paper on unsupervised summ.
which uses gpt-2 as a kind of pseudo summrizer, then score the summaries generated by it using two other models for fluency and coverage and train the summarizer using reinforcement learning to maximize the fluency and coverage score
I read the FAIR self-training paper and it is very relevant, but I am struggling to understand the specifics of the approach.
Things I understood:
âuse beam search decoding (beam size 5) to create the pseudo targets and to report BLEU on test set.â This outperforms sampling.
They train with dropout rate at 0.3. This helps.
They donât do any cleaning or checking of the pseudolabels against the ground truth target.
all experiments are run on 8 GPUs with an effective batch size of 33K tokens.
Things I did not understand
In Figure 1 (page 3)
What is the difference between the light shade (pseudo-training) and dark shade (fine-tune) bars?
Section 3.2 tries to explain
(Below is excerpted then markdownd)
In Figure 1, we use green bars to show the result of applying self-training for three iterations. We include both
(1) pseudo-training (PT): the first step of self-training where we train a new model (from scratch) using only the pseudo parallel data generated by the current model, and
(2) fine- tuning (FT): the fine-tuned system using real parallel data based on the pretrained model from the PT step.
Note that in the fine-tuning step the system is re-initialized from scratch.
Surprisingly, we find that the pseudo-training step at the first iteration is able to improve BLEU even if the model is only trained on its own predictions, and fine-tuning further boosts the performance. The test BLEU keeps improving over the first three iterations, until convergence to outperform the initial baseline by 3 BLEU points.
So when they wrote: âthe fine-tuned system using real parallel data based on the pretrained model from the PT stepâ,
I guess they mean At each interation, the fine-tuned system uses real parallel data and self-training data based on the trained model from the last step
Pseudocode of my understanding:
def fair_self_training_procedure(parallel_data, unlabeled_data, mode = 'pseudo training'):
real_data = 100K pairs of (english-german sentences)
unlabeled data = 3m English sentences
pseudo_dataset = None
# baseline
model = randomly_initialize('transformer')
model = train(model, original_data, dropout=0.3)
baseline_performance = model.evaluate(original_validation_data) # 15.6
pseudo_dataset = (unlabeled_data, model.generate(unlabeled_data, num_beams=5))
if mode == 'fine-tune': # HELP
pseudo_dataset = pseudo_dataset + parallel_data
scores = []
for iteration in range(3): # iteration 1,2,3
model = randomly_initialize('transformer')
# Even in the fine-tuning step the system is re-initialized from scratch.
inject_noise(pseudo_dataset)
model = train(model, pseudo_dataset, dropout=0.3)
scores.append(model.evalute(original_validation_data))
pseudo_dataset = (unlabeled_data, model.generate(unlabeled_data, num_beams=5))
if mode == 'fine-tune': # HELP
pseudo_dataset = pseudo_dataset + parallel_data
I read it the same way" at each iteration, a new model is initialized from scratch, then pseudo-trained (PT) on the artificial data, then fine-tuned (FT) on the real data
That is coherent with how they present it in Section 2: In our preliminary experiments, we find that the separate training strategy with the whole pseudo parallel dataset (i.e. S = {(x, fθ(x))|x â U}) produces better or equal performance for neural sequence generation while being simpler. Therefore, in the remainder of this paper we use this simpler setting.