I am attempting to fine-tune LoRA on llama-7b with c4’s realnewslike dataset.
Following the tutorial, after loading the c4 dataset with load_dataset, I’m using dataset.map to process the data into tokens. The code looks like this:
You can speed up tokenization by returning numpy arrays instead of python lists
The map() function converts the returned values to a PyArrow-supported format. But explicitly returning the tensors as NumPy arrays is faster because it is a natively supported PyArrow format. Set return_tensors="np" when you tokenize your text: