Hello there
I am trying to run the script run_mlm.py to perform the training of a BERT model. Basically, the idea is to start from an already existing italian BERT model, perform a second training on a specific topic (biomedical texts), and later fine-tune it on a QuestionAnswering downstream task.
I was able to run the run_mlm.py script, both with and without –line_by_line parameter. I have a couple of questions, if you could kindly answer or point me to somewhere in the documentation:
-
The run with –line_by_line took like 10x longer that the one without, why? I have full access to the complete dataset, and I can organize it as I want, so which is the best format?
-
Is there a way to feed the model more files, if my corpus is split on several of them?
-
Does this script train the model for the NSP task as well?
-
If I evaluate the model, I get the perplexity score. Is there a way to get accuracy for the NSP task? (I think that accuracy does not make sense for MLM, right?)
Many thanks for your patience
Ok so, regarding point 2 (load more files at the same time), I tried to tweak the code a bit (line 280 of the run_mlm.py original script).
ORIGINAL CODE:
else:
data_files = {}
if data_args.train_file is not None:
data_files["train"] = data_args.train_file
extension = data_args.train_file.split(".")[-1]
if data_args.validation_file is not None:
data_files["validation"] = data_args.validation_file
extension = data_args.validation_file.split(".")[-1]
if extension == "txt":
extension = "text"
raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
MY EDITS:
else:
data_files = {}
if data_args.train_file is not None:
files_full_path = []
for filepath in os.listdir(data_args.train_file):
files_full_path.append(data_args.train_file+filepath)
data_train_from_file = []
for el in files_full_path:
data_train_from_file.append(el)
data_files["train"] = data_train_from_file
extension = files_full_path[0].split(".")[-1]
if data_args.validation_file is not None:
data_files["validation"] = data_args.validation_file
extension = data_args.validation_file.split(".")[-1]
if extension == "txt":
extension = "text"
raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
Do you think this approach can make sense? I am trying the code right now and it seems to work. My concern is that when I will load the entire dataset (several GBs of text data), the training will crash/be extremely long (several weeks).
Thanks everyone