BERTFastTokenizer: Out of memory Pre-processing sequence Error

cdb0103 · November 23, 2024, 8:13pm

Sorry for the trouble, but I’ve had some problems trying to get BertFastTokenizer to work with my csv data. I’ll summarize as best I can.

End-Goal: A Bert Model, fine tuned, which can take in the Assembly of an executable(GCC-x86-Intel in this specific case, so C/C++ based software) and then categorize it into what potential common exploits exist.

Data: Megavul processed through several scripts for both the vulnerable and patched assembly instructions for those specific repos. The specific files containing the vulnerabilities will be used for training the Model, But I also have object files for all other built files in those repos as well, and I figured those could be used as the corpus. I have two corpuses/corpi, one is ~100 GB and the other ~10, stored as 1 “column” .csv files. Both have produced the same issue.

Issue: During training of the BERTFastTokenizer, I’ve been having issued during the pre-processing sequence. I cannot seem to find much information about what might be causing this issue though I’ve been seeing people having issues with much more powerful hardware than mine. Below is my script.

def train_tokenizer():
    dest_path = json_operations.get_file_path(TOKENIZER_DIR)
    csv_path = os.path.join(dest_path, TOKENIZED_FILE)
    dataset = load_dataset("csv", data_files=csv_path)

    def get_training_corpus():
        return (
            dataset["train"][i : i + 100]["func_text"]
            for i in range(0, len(dataset["train"]), 100)
            )
    training_corpus = get_training_corpus()
    tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased', do_lower_case=False)
    tokenizer = tokenizer.train_new_from_iterator(
        training_corpus,
        vocab_size=30522,
    )

    
    tokenizer_path = json_operations.get_file_path(TOKENIZER_DIR)
    bert_tokenizer_path = os.path.join(tokenizer_path, BERT_TOKENIZER)
    tokenizer.save_pretrained(bert_tokenizer_path)

Below is the output/error message:

[00:04:31] Pre-processing sequences       ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 38531    /        0memory allocation of 17179869184 bytes failed

Hardware: I only have this one computer, and do not have access to my university’s machines right now.

CPU: AMD Ryzen 7 7700X 8-Core Processor, 4501 Mhz, 8 Core(s), 16 Logical Processor(s)

GPU: Nvidia GeForce RTX 4070 (I thought it was super but it might not be)

RAM: 32 GB

I’ll be honest, the issue may just be I have hardware that’s not up to snuff for this process and I may need to skip this step and take the plunge using what data I do have. I do mnot have much time left.

cdb0103 · November 24, 2024, 11:36pm

After further investigation this problem may not be with AutoTokenizer at all.

I restricted the training to 30000 (below the len of the dataset from the smallest file) and this was my output:

[00:02:46] Pre-processing sequences       ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 0        /        0[00:00:16] Tokenize words                 ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 8042500  /  8042500
[00:00:46] Count pairs                    ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 8042500  /  8042500
[00:00:42] Compute merges                 ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 30425    /    30425

While I did not save I also attempted this with the last 3000 items starting at item 30000 and the result was a freeze/ memory leak. My current Hypothesis is that there are values interspersed within the dataset that when loaded with other values are too large to load.

Now The only issue is how to get thee iterator to filter out overly large values. If such a solution works with this small file, I will then attempt the same with the larger 10GB file and if that too works I will try with my original file of 100GB.

John6666 · November 25, 2024, 1:56am

Files on HF servers are usually limited to 50GB per file, but since you’re running it locally, this isn’t really relevant…
Anyway, the only thing that seemed like a problem was the file size.

Topic		Replies	Views
Tokenizer taking lot of memory 🤗Transformers	3	3465	April 16, 2023
Tokenizer Trainer Crashing 🤗Tokenizers	0	699	April 15, 2023
Pre-train BERT with HF Trainer 🤗Transformers	0	739	April 22, 2022
Fine-tuned BERT tokenizer taking too long to load 🤗Tokenizers	1	3431	August 23, 2022
Why is the tensor produced by inference so big? Beginners	2	431	April 17, 2023

BERTFastTokenizer: Out of memory Pre-processing sequence Error

Related topics