I am attempting to fine-tune LoRA on llama-7b with c4’s realnewslike dataset.
Following the tutorial, after loading the c4 dataset with load_dataset, I’m using dataset.map to process the data into tokens. The code looks like this:
However, I found that processing c4 will take up to 3 hours!
I was wondering if there is a faster way I can do this…
You can speed up tokenization by returning numpy arrays instead of python lists
The map() function converts the returned values to a PyArrow-supported format. But explicitly returning the tensors as NumPy arrays is faster because it is a natively supported PyArrow format. Set
return_tensors="np" when you tokenize your text:
dataset = dataset.map(lambda examples: tokenizer(examples["text"], return_tensors="np"), batched=True)
source: Process text data