In the BERT paper, it says:
We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus.
How does this equation work?
What is the unit “word” in “3.3 billion word corpus”? Is it the same as output from
wc -w command on the entire text corpus? If this unit is a raw token, is there a guarantee that the number of “words” in the entire corpus matches the number tokens in the whole dataset after data preparation with
create_pretraining_data.py (assume duplicate factor is set to 1)?
According to this line of code, in a training instance, some WordPiece tokens in the sequence will be dropped from the front or the back if the sequence is longer than max sequence length. Is this taken into account?
If I understand this function correctly, when the next segment gets randomly chosen, the segment that was there before it was swapped with this randomly chosen segment will be “put back.” (here) Does this mean that we have more tokens in total because of these randomly chosen segments?
(I opened an issue on Google’s repository, but I wanted to ask this in this community as well.)