Understanding DataCollation

rkbelew · July 18, 2024, 4:05pm

[Cross-posting from Pytorch forum ]

I’m interested in the Caseholder dataset, a multiple choice task ala SWAG where examples
include five separate possible answers. I use a preprocess function like this:

def preprocess_function(examples,mapped=True):
        
    # Repeat  first sentence NMultChoice times to go with the NMultChoice possibilities of second sentences.
    first_sentences = [[context] * NMultChoice for context in examples[Context_name]]
    
    # Grab all second sentences possible for each context.
    second_sentences = [ [ f"{examples[str(i+1)][0]}" for i in range(NMultChoice) ] for context in examples[Context_name]]
    
    # Flatten out
    first_sentences = sum(first_sentences, [])
    second_sentences = sum(second_sentences, [])
    
    # Tokenizer will return a list of lists of lists for each key: 
    # a list of all examples (here 5), then a list of all choices (4) and a list of input IDs 
    # (length varying here since we did not apply any padding)
    tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
    
    # Un-flatten so that each example has NMultChoice input ids, attentions masks, etc.
    tokDict = {k: [v[i : i + NMultChoice] for i in range(0, len(v), NMultChoice)] for k, v in tokenized_examples.items()}
        
    tokDict['label'] = examples[Label_name]
    tokDict['idx'] = examples[Idx_name]
    # capture casehold scores
    tokDict['scores'] = {f'{i}': example[f'{i+6}'] for i in range(NMultChoice)}
        
    return tokDict

and generate a processed dataset like this:

ds = datasets.load_dataset(csvDir,data_files=dsFiles, download_mode="force_redownload")
ds2 = ds.map(preprocessOne)

The tokenized dataset has features that looks like this:

	Dataset({
	features: ['Unnamed: 0', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', 'input_ids', 'token_type_ids', 'attention_mask', 'label', 'idx', 'scores'],
	num_rows: 5314
	})

Eg, focusing on the input_ids generated by the tokenizer, it is a list of lists of varying lengths:

	iid = eval_dataset['input_ids']
	iid00 = iid[0][0]
	for l in iid00: print (len(l))

	> 224
	> 225
	> 229
	> 228
	> 224

Also because of the multiple choice task, I’m using a DataCollator modeled after this swag example:

class DataCollatorForMultipleChoice:

tokenizer: PreTrainedTokenizerBase
padding: Union[bool, str, PaddingStrategy] = True
max_length: Optional[int] = None
pad_to_multiple_of: Optional[int] = None

def __call__(self, features):
    label_name = "label" if "label" in features[0].keys() else Label_name # "labels"
    labels = [feature.pop(label_name) for feature in features]
    batch_size = len(features)
    num_choices = len(features[0]["input_ids"])
    flattened_features = []
    for feature in features:
        mclist = []
        for i in range(NMultChoice):
            mcdict = {}
            mcdict['input_ids'] = feature['input_ids']
            mcdict['token_type_ids'] = feature['token_type_ids']
            mcdict['attention_mask'] = feature['attention_mask']
            mclist.append(mcdict)
        flattened_features.append(mclist)
        
    flattened_features = list(chain(*flattened_features))
        
    batch = self.tokenizer.pad(
        flattened_features,
        padding=self.padding,
        max_length=self.max_length,
        pad_to_multiple_of=self.pad_to_multiple_of,
        return_tensors="pt",
    )
    
    # Un-flatten
    batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
    # Add back labels
    batch["labels"] = torch.tensor(labels, dtype=torch.int64)
    return batch

This DataCollator is then used in specification of the Trainer, specifically specifying pad_to_multiple_of=8 and then using it to create batches:

    data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer, pad_to_multiple_of=8),

    eval_loader = trainer.get_eval_dataloader(eval_dataset)
    for batch in eval_loader:

Now I am getting this exception:

	File "/Users/rik/Code/eclipse/ai4law/src/casehold_demo-v2.py", line 444, in main
	for batch in eval_loader:
	File ".../lib/python3.11/site-packages/accelerate/data_loader.py", line 452, in __iter__
	current_batch = next(dataloader_iter)
	^^^^^^^^^^^^^^^^^^^^^
	File ".../lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
	data = self._next_data()
	^^^^^^^^^^^^^^^^^
	File ".../lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
	data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	File ".../lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 57, in fetch
	return self.collate_fn(data)
	^^^^^^^^^^^^^^^^^^^^^
	File "/Users/rik/Code/eclipse/ai4law/src/casehold_demo-v2.py", line 157, in __call__
	batch = self.tokenizer.pad(
	^^^^^^^^^^^^^^^^^^^
	File ".../lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3380, in pad
	return BatchEncoding(batch_outputs, tensor_type=return_tensors)
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	File ".../lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 224, in __init__
	self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
	File ".../lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 775, in convert_to_tensors
	raise ValueError(
	ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

I’m doing truncation as part of tokenization, and padding the batch examples with pad_to_multiple_of=8. Would you knopw why would I be getting this error?

Topic		Replies	Views
Cannot get DataCollator to prepare tf dataset 🤗Transformers	0	477	July 15, 2022
Issues with Data Collator and Tokenizing with NER Datasets 🤗Tokenizers	1	2527	May 9, 2022
ValueError in using DataCollator: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length 🤗Transformers	1	7628	January 26, 2023
Can't iterate a DataLoader 🤗Datasets	3	1423	February 25, 2022
DataCollator for training mbart50 for translation with custom dataset Beginners	0	347	June 24, 2021

Understanding DataCollation

Related topics