Hi, I would like to try the approach suggested in “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks” (link) for BART. I have my own dataset but there are 2 things that are still unclear to me.
I believe I should start with BartForConditionalGeneration , as that is the LM model. is that right?
Can anyone provide more details on the noising algorithm that was used to train the model? The paper is pretty vague about it, as these are the only details I found
A number of text spans are sampled, with span lengths drawn from a Poisson distribution(λ = 3)
We mask 30% of tokens in each document, and permute all sentences.