Hi, I would like to try the approach suggested in “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks” (link) for BART. I have my own dataset but there are 2 things that are still unclear to me.
-
I believe I should start with BartForConditionalGeneration , as that is the LM model. is that right?
-
Can anyone provide more details on the noising algorithm that was used to train the model? The paper is pretty vague about it, as these are the only details I found
A number of text spans are sampled, with span lengths drawn from a Poisson distribution(λ = 3)
We mask 30% of tokens in each document, and permute all sentences.