Sorry for the trouble, but I’ve had some problems trying to get BertFastTokenizer to work with my csv data. I’ll summarize as best I can.
End-Goal: A Bert Model, fine tuned, which can take in the Assembly of an executable(GCC-x86-Intel in this specific case, so C/C++ based software) and then categorize it into what potential common exploits exist.
Data: Megavul processed through several scripts for both the vulnerable and patched assembly instructions for those specific repos. The specific files containing the vulnerabilities will be used for training the Model, But I also have object files for all other built files in those repos as well, and I figured those could be used as the corpus. I have two corpuses/corpi, one is ~100 GB and the other ~10, stored as 1 “column” .csv files. Both have produced the same issue.
Issue: During training of the BERTFastTokenizer, I’ve been having issued during the pre-processing sequence. I cannot seem to find much information about what might be causing this issue though I’ve been seeing people having issues with much more powerful hardware than mine. Below is my script.
def train_tokenizer():
dest_path = json_operations.get_file_path(TOKENIZER_DIR)
csv_path = os.path.join(dest_path, TOKENIZED_FILE)
dataset = load_dataset("csv", data_files=csv_path)
def get_training_corpus():
return (
dataset["train"][i : i + 100]["func_text"]
for i in range(0, len(dataset["train"]), 100)
)
training_corpus = get_training_corpus()
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased', do_lower_case=False)
tokenizer = tokenizer.train_new_from_iterator(
training_corpus,
vocab_size=30522,
)
tokenizer_path = json_operations.get_file_path(TOKENIZER_DIR)
bert_tokenizer_path = os.path.join(tokenizer_path, BERT_TOKENIZER)
tokenizer.save_pretrained(bert_tokenizer_path)
Below is the output/error message:
[00:04:31] Pre-processing sequences ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 38531 / 0memory allocation of 17179869184 bytes failed
Hardware: I only have this one computer, and do not have access to my university’s machines right now.
CPU: AMD Ryzen 7 7700X 8-Core Processor, 4501 Mhz, 8 Core(s), 16 Logical Processor(s)
GPU: Nvidia GeForce RTX 4070 (I thought it was super but it might not be)
RAM: 32 GB
I’ll be honest, the issue may just be I have hardware that’s not up to snuff for this process and I may need to skip this step and take the plunge using what data I do have. I do mnot have much time left.