How to handle big data?

I am trying to use datasets.load_dataset to load multiple big files from disk. I noticed it could merge the content of the files together to be a single dataset. And then I called the datasets.map to tokenize the dataset.
My question is: If the total content of the files are much bigger than RAM, what will happen? Will the program break down or consume very long time? Can the map be executed in training lazily?

Thank you very much for your reply!

If the total content of the files are much bigger than RAM, what will happen? Will the program break down or consume very long time?

It shouldn’t crash. map processes and writes data to disk in batches to support transforming datasets bigger than RAM.

Can the map be executed in training lazily?

You can use set_transform instead of map for lazy execution.

Thank you for your reply!
I tried the set_transform, but the generated dataset cannot work with Trainer.train. It seems the dataloader cannot load data from the dataset bacause of index issue. The exception is like below:
File “D:\Environment\anaconda\envs\virsyn\lib\site-packages\torch\utils\data\dataloader.py”, line 521, in next
data = self._next_data()
File “D:\Environment\anaconda\envs\virsyn\lib\site-packages\torch\utils\data\dataloader.py”, line 1203, in _next_data
return self._process_data(data)
File “D:\Environment\anaconda\envs\virsyn\lib\site-packages\torch\utils\data\dataloader.py”, line 1229, in _process_data
data.reraise()
File “D:\Environment\anaconda\envs\virsyn\lib\site-packages\torch_utils.py”, line 434, in reraise
raise exception
IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
File “D:\Environment\anaconda\envs\virsyn\lib\site-packages\torch\utils\data_utils\worker.py”, line 287, in _worker_loop
data = fetcher.fetch(index)
File “D:\Environment\anaconda\envs\virsyn\lib\site-packages\torch\utils\data_utils\fetch.py”, line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File “D:\Environment\anaconda\envs\virsyn\lib\site-packages\torch\utils\data_utils\fetch.py”, line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File “D:\Environment\anaconda\envs\virsyn\lib\site-packages\datasets\arrow_dataset.py”, line 2155, in getitem
key,
File “D:\Environment\anaconda\envs\virsyn\lib\site-packages\datasets\arrow_dataset.py”, line 2138, in _getitem
pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
File “D:\Environment\anaconda\envs\virsyn\lib\site-packages\datasets\formatting\formatting.py”, line 486, in query_table
_check_valid_index_key(key, size)
File “D:\Environment\anaconda\envs\virsyn\lib\site-packages\datasets\formatting\formatting.py”, line 429, in _check_valid_index_key
raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
IndexError: Invalid key: 293 is out of bounds for size 0

Debugging an error like this without the code (reproducible example) is hard. Can you share the code that instantiates the dataset and the trainer? Feel free to share a dummy dataset (with the same structure) to keep the data private.

To provide lazily tokenize function, I extract the tokenize function to be a class:

> class LazyTokenize:
>     def __init__(self, tokenizer, text_column_name, padding, max_seq_length):
>         self.tokenizer = tokenizer
>         self.text_column_name = text_column_name
>         self.padding = padding
>         self.max_seq_length = max_seq_length
> 
>     def tokenize_function(self, examples):
>         # Remove empty lines
>         examples[self.text_column_name] = [
>             line for line in examples[self.text_column_name] if len(line) > 0 and not line.isspace()
>         ]
>         return self.tokenizer(
>             examples[self.text_column_name],
>             padding=self.padding,
>             truncation=True,
>             max_length=self.max_seq_length,
>             return_special_tokens_mask=True,
>         )

Then load dataset like below:

> data_files={'train': ['../data/pretrain/train\\file1.csv', '../data/pretrain/train\\file2.csv', '../data/pretrain/train\\file3.csv'], 'validation': ['../data/pretrain/eval\\file1.csv', '../data/pretrain/eval\\file2.csv', '../data/pretrain/eval\\file3.csv']}
> raw_datasets = load_dataset(extension, data_files=data_files, cache_dir=model_args.cache_dir)
> lazy_tokenizer = LazyTokenize(tokenizer, text_column_name, padding, max_seq_length)
> tokenized_datasets = raw_datasets.with_transform(lazy_tokenizer.tokenize_function)

Then use the trainer.py of transformer, the calling is like below:

> train_dataset = tokenized_datasets["train"]
> eval_dataset = tokenized_datasets["validation"]
> trainer = Trainer(
>         model=model,
>         args=training_args,
>         train_dataset=train_dataset if training_args.do_train else None,
>         eval_dataset=eval_dataset if training_args.do_eval else None,
>         tokenizer=tokenizer,
>         data_collator=data_collator,
>         callbacks=[early_stopping_callback]
>     )
> 
> trainer.train(resume_from_checkpoint=checkpoint)

Before using the with_transform, the datasets.map function worked well with the tokenize function.

Thank you very much for your help!

The number of output examples should match the number of input examples in the transform function, so instead of removing empty lines with:

examples[self.text_column_name] = [
    line for line in examples[self.text_column_name] if len(line) > 0 and not line.isspace()
]

run a filter before with_transform like so:

raw_datasets.filter(lambda ex: len(ex[self.text_column_name]) > 0 and not ex[self.text_column_name].isspace())

Or, even better, use map instead of with_transform if there is enough disk space for the cache file (tokenization is the same across the epochs, so it’s more efficient to do it once than lazily each epoch):

def tokenize_function(self, examples):
    # Remove empty lines
    examples[self.text_column_name] = [
        line for line in examples[self.text_column_name] if len(line) > 0 and not line.isspace()
    ]
    return self.tokenizer(
       examples[self.text_column_name],
       padding=self.padding,
       truncation=True,
       max_length=self.max_seq_length,
       return_special_tokens_mask=True,
       return_tensors="np", # makes converting to Arrow faster (needed for caching)
    )

# map with `batched=True` can return less/more examples than there are in the input
tokenized_datasets = raw_datasets.map(tokenizer.tokenize_function, batched=True, remove_columns=raw_datasets["train"].column_names)

it can be related to IndexError: Invalid key: 16 is out of bounds for size 0 - #3 by Isma

Thank you all for your help!
I decided not to use transform to implement lazy tokenization.
But now I am facing another problem: the data is too big, and the tokenization, DataCollator steps before training will cost too much time!
Does huggingface transformer support to distribute such data processes into many computers to enhance executing speed?