Script run_mlm.py line by line

Neuroinformatica · January 21, 2022, 3:14pm

Hello there

I am trying to run the script run_mlm.py to perform the training of a BERT model. Basically, the idea is to start from an already existing italian BERT model, perform a second training on a specific topic (biomedical texts), and later fine-tune it on a QuestionAnswering downstream task.

I was able to run the run_mlm.py script, both with and without –line_by_line parameter. I have a couple of questions, if you could kindly answer or point me to somewhere in the documentation:

The run with –line_by_line took like 10x longer that the one without, why? I have full access to the complete dataset, and I can organize it as I want, so which is the best format?
Is there a way to feed the model more files, if my corpus is split on several of them?
Does this script train the model for the NSP task as well?
If I evaluate the model, I get the perplexity score. Is there a way to get accuracy for the NSP task? (I think that accuracy does not make sense for MLM, right?)

Many thanks for your patience

Neuroinformatica · January 24, 2022, 3:04pm

Ok so, regarding point 2 (load more files at the same time), I tried to tweak the code a bit (line 280 of the run_mlm.py original script).

ORIGINAL CODE:

    else:
        data_files = {}
        if data_args.train_file is not None:
            data_files["train"] = data_args.train_file
            extension = data_args.train_file.split(".")[-1]
        if data_args.validation_file is not None:
            data_files["validation"] = data_args.validation_file
            extension = data_args.validation_file.split(".")[-1]
        if extension == "txt":
            extension = "text"
        raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)

MY EDITS:

    else:
        data_files = {}
        if data_args.train_file is not None:
            files_full_path = []
            for filepath in os.listdir(data_args.train_file):
                files_full_path.append(data_args.train_file+filepath)
            data_train_from_file = []
            for el in files_full_path:
                data_train_from_file.append(el)
            data_files["train"] = data_train_from_file
            extension = files_full_path[0].split(".")[-1]
        if data_args.validation_file is not None:
            data_files["validation"] = data_args.validation_file
            extension = data_args.validation_file.split(".")[-1]
        if extension == "txt":
            extension = "text"
        raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)

Do you think this approach can make sense? I am trying the code right now and it seems to work. My concern is that when I will load the entire dataset (several GBs of text data), the training will crash/be extremely long (several weeks).

Thanks everyone

Topic		Replies	Views
Data format in run_lm_fine_tuning.py Beginners	2	420	September 8, 2020
Fine-tuning lm with nsp 🤗Transformers	0	1174	January 19, 2021
Saving memory with run_mlm.py with wikipedia datasets 🤗Transformers	0	723	March 4, 2021
Train bert from scratch using run_mlm.py Beginners	0	813	March 25, 2022
Incremental Training using run_mlm.py 🤗Transformers	0	307	December 12, 2022

Script run_mlm.py line by line

Related topics