Smarter way to load C4 dataset

hoangd96 · November 5, 2023, 12:09am

I am attempting to fine-tune LoRA on llama-7b with c4’s realnewslike dataset.
Following the tutorial, after loading the c4 dataset with load_dataset, I’m using dataset.map to process the data into tokens. The code looks like this:

However, I found that processing c4 will take up to 3 hours!
I was wondering if there is a faster way I can do this…

lhoestq · November 6, 2023, 10:56am

You can speed up tokenization by returning numpy arrays instead of python lists

The map() function converts the returned values to a PyArrow-supported format. But explicitly returning the tensors as NumPy arrays is faster because it is a natively supported PyArrow format. Set return_tensors="np" when you tokenize your text:
dataset = dataset.map(lambda examples: tokenizer(examples["text"], return_tensors="np"), batched=True)

source: Process text data

Topic		Replies	Views
Dataset map function takes forever to run! 🤗Datasets	16	6613	August 15, 2024
Dataset map return only list instead torch tensors Beginners	8	5631	March 17, 2025
Improve performance IterableDataset (with tokenization) 🤗Datasets	2	766	November 2, 2023
When using Dataset.map to tokenize a dataset, the speed slows down as the progress approaches 100% 🤗Datasets	3	885	December 23, 2024
Generating Vocabulary using Datasets 🤗Datasets	1	1430	August 30, 2022

Smarter way to load C4 dataset

Related topics