Hello,
For context: I am working on a sequence classification task, using a RoBERTa derivative model
from transformers import AutoModelForSequenceClassification
model_Mol_seq2seq = AutoModelForSequenceClassification.from_pretrained(“model/name”, num_labels=8, deterministic_eval=True, trust_remote_code=True)
tokenizer_3 = AutoTokenizer.from_pretrained(“model/name”, trust_remote_code=True)
Furthermore, I have a large dataset with two columns labelled text
and label
, respectively, and I need to tokenise the values in the text
column.
The tokeniser is wrapped like this:
def tokenize_function(examples, col='text'):
return tokenizer(examples[col], truncation=True, padding='max_length', max_length=768)
If I use the map
method and apply the tokeniser on the text
column the size of the dataset explodes to several hundreds of GB, which I cannot find storage for. Therefore, I want to use .set_transform(tokenize_function)
on the dataset.
When using the map method with the tokeniser_function on the data set (i.e. dataset = dataset.map(tokenizer_function)), I get a dataset with the following columns:
Dataset({
features: ['text', 'label', 'input_ids', 'attention_mask'],
num_rows: 1
})
which is what I expect.
However, if I use dataset.set_transform(tokenizer_function)
, dataset
yields
(Dataset({
features: ['text', 'label'],
num_rows: 1
})
and dataset[0] = {'input_ids': [0,4, 9, …], attention_mask': [1,1,1,….]}
, and there is no entry for label
. When using the dataset.set_transform(…)
option as input in Trainer
, I get the following error:
ValueError: The model did not return a loss from the inputs, only the following keys: logits. For reference, the inputs it received are input_ids,attention_mask.
What am I doing wrong?
Many thanks in advance.
PS! I know similar questions have been asked, but couldn’t find a recent one that addresses the above, specifically.