Script run_mlm.py line by line

Hello there

I am trying to run the script run_mlm.py to perform the training of a BERT model. Basically, the idea is to start from an already existing italian BERT model, perform a second training on a specific topic (biomedical texts), and later fine-tune it on a QuestionAnswering downstream task.

I was able to run the run_mlm.py script, both with and without –line_by_line parameter. I have a couple of questions, if you could kindly answer or point me to somewhere in the documentation:

  1. The run with –line_by_line took like 10x longer that the one without, why? I have full access to the complete dataset, and I can organize it as I want, so which is the best format?

  2. Is there a way to feed the model more files, if my corpus is split on several of them?

  3. Does this script train the model for the NSP task as well?

  4. If I evaluate the model, I get the perplexity score. Is there a way to get accuracy for the NSP task? (I think that accuracy does not make sense for MLM, right?)

Many thanks for your patience

Ok so, regarding point 2 (load more files at the same time), I tried to tweak the code a bit (line 280 of the run_mlm.py original script).

ORIGINAL CODE:

    else:
        data_files = {}
        if data_args.train_file is not None:
            data_files["train"] = data_args.train_file
            extension = data_args.train_file.split(".")[-1]
        if data_args.validation_file is not None:
            data_files["validation"] = data_args.validation_file
            extension = data_args.validation_file.split(".")[-1]
        if extension == "txt":
            extension = "text"
        raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)

MY EDITS:

    else:
        data_files = {}
        if data_args.train_file is not None:
            files_full_path = []
            for filepath in os.listdir(data_args.train_file):
                files_full_path.append(data_args.train_file+filepath)
            data_train_from_file = []
            for el in files_full_path:
                data_train_from_file.append(el)
            data_files["train"] = data_train_from_file
            extension = files_full_path[0].split(".")[-1]
        if data_args.validation_file is not None:
            data_files["validation"] = data_args.validation_file
            extension = data_args.validation_file.split(".")[-1]
        if extension == "txt":
            extension = "text"
        raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)

Do you think this approach can make sense? I am trying the code right now and it seems to work. My concern is that when I will load the entire dataset (several GBs of text data), the training will crash/be extremely long (several weeks).

Thanks everyone