Tokenizer to dataset to datacollator

dashapyly · April 22, 2022, 7:57pm

Hi, I’m having trouble finding how does one from the output of
inputs = tokenizer([text], return_tensors = ‘pt’)
to a huggingface dataset. Every time a dataset is mentioned it’s being loaded with “load_dataset”. I want to tokenize with line 2 and then create the dataset instead of using .map once I “load_dataset” from parquets. Can I do that? Alternatively, if you could point me to how prepare the dataset for a MLM if it’s to be used by a DataCollator - that would help too. (ie how to properly make labels from input_ids, am I right to leave input_ids unmasked if it’s going into DataCollator, do I need to remove any columns, etc).
I’m also not clear how to use DataCollator with MLM. I don’t want to pad to a max_length, I want for padding to be handled separately for each batch to the “longest”. How do I do that? When I set pad_to_multiple_of to 8, I still can an error from trying to run trainer.
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length.
Thanks a lot!
PS Is there an English version of this notebook? notebooks/training.ipynb at main · huggingface/notebooks · GitHub

mariosasko · April 28, 2022, 2:06pm

Hi!

This tutorial explains how to prepare input data for MLM. Additionally, we have this code packaged in scripts here (run_mlm.py and run_mlm_no_trainer.py)

PS Is there an English version of this notebook? notebooks/training.ipynb at main · huggingface/notebooks · GitHub

This should be fixed now.

Topic		Replies	Views
ValueError in using DataCollator: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length 🤗Transformers	1	7558	January 26, 2023
DeBERTa - ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length 🤗Tokenizers	2	1483	October 3, 2023
DataCollator for training mbart50 for translation with custom dataset Beginners	0	346	June 24, 2021
Error in Model.prepare_tf_dataset() 🤗Transformers	1	696	July 5, 2023
Issues with Data Collator and Tokenizing with NER Datasets 🤗Tokenizers	1	2510	May 9, 2022

Tokenizer to dataset to datacollator

Related topics