Tokenizer to dataset to datacollator

Hi, I’m having trouble finding how does one from the output of
inputs = tokenizer([text], return_tensors = ‘pt’)
to a huggingface dataset. Every time a dataset is mentioned it’s being loaded with “load_dataset”. I want to tokenize with line 2 and then create the dataset instead of using .map once I “load_dataset” from parquets. Can I do that? Alternatively, if you could point me to how prepare the dataset for a MLM if it’s to be used by a DataCollator - that would help too. (ie how to properly make labels from input_ids, am I right to leave input_ids unmasked if it’s going into DataCollator, do I need to remove any columns, etc).
I’m also not clear how to use DataCollator with MLM. I don’t want to pad to a max_length, I want for padding to be handled separately for each batch to the “longest”. How do I do that? When I set pad_to_multiple_of to 8, I still can an error from trying to run trainer.
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with ‘padding=True’ ‘truncation=True’ to have batched tensors with the same length.
Thanks a lot!
PS Is there an English version of this notebook? notebooks/training.ipynb at main · huggingface/notebooks · GitHub

1 Like

Hi!

This tutorial explains how to prepare input data for MLM. Additionally, we have this code packaged in scripts here (run_mlm.py and run_mlm_no_trainer.py)

PS Is there an English version of this notebook? notebooks/training.ipynb at main · huggingface/notebooks · GitHub

This should be fixed now.