Any progress here ? I would be so convenient to train a Bert from scratch using datasets
and transformers
. Does anyone achieve this with comparable results as original Bert ?
Hi @BramVanroy is there an example for pretraining bert on NSP tasks with dataset.map
? Thanks!
Hi @vblagoje , I found the file_path
param of TextDatasetForNextSentencePrediction
is only one file. Does it mean that I need to convert all datasets into one file when splitting sentences? But this file will be too big.
To chunk the articles you can check https://huggingface.co/docs/datasets/processing.html#augmenting-the-dataset
new link is https://huggingface.co/docs/datasets/process#data-augmentation