Smarter way to load C4 dataset

I am attempting to fine-tune LoRA on llama-7b with c4’s realnewslike dataset.
Following the tutorial, after loading the c4 dataset with load_dataset, I’m using dataset.map to process the data into tokens. The code looks like this:

However, I found that processing c4 will take up to 3 hours!
I was wondering if there is a faster way I can do this…

You can speed up tokenization by returning numpy arrays instead of python lists

The map() function converts the returned values to a PyArrow-supported format. But explicitly returning the tensors as NumPy arrays is faster because it is a natively supported PyArrow format. Set return_tensors="np" when you tokenize your text:

dataset = dataset.map(lambda examples: tokenizer(examples["text"], return_tensors="np"), batched=True)

source: Process text data