How to use set_transform when map becomes unfeasible?

Hello,

For context: I am working on a sequence classification task, using a RoBERTa derivative model

from transformers import AutoModelForSequenceClassification
model_Mol_seq2seq = AutoModelForSequenceClassification.from_pretrained(“model/name”, num_labels=8, deterministic_eval=True, trust_remote_code=True)
tokenizer_3 = AutoTokenizer.from_pretrained(“model/name”, trust_remote_code=True)

Furthermore, I have a large dataset with two columns labelled text and label, respectively, and I need to tokenise the values in the text column.
The tokeniser is wrapped like this:

def tokenize_function(examples, col='text'):
    return tokenizer(examples[col], truncation=True, padding='max_length', max_length=768)

If I use the map method and apply the tokeniser on the text column the size of the dataset explodes to several hundreds of GB, which I cannot find storage for. Therefore, I want to use .set_transform(tokenize_function) on the dataset.

When using the map method with the tokeniser_function on the data set (i.e. dataset = dataset.map(tokenizer_function)), I get a dataset with the following columns:

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 1
})

which is what I expect.

However, if I use dataset.set_transform(tokenizer_function), dataset yields

(Dataset({
     features: ['text', 'label'],
     num_rows: 1
 })

and dataset[0] = {'input_ids': [0,4, 9, …], attention_mask': [1,1,1,….]}, and there is no entry for label. When using the dataset.set_transform(…) option as input in Trainer, I get the following error:

ValueError: The model did not return a loss from the inputs, only the following keys: logits. For reference, the inputs it received are input_ids,attention_mask.

What am I doing wrong?

Many thanks in advance.

PS! I know similar questions have been asked, but couldn’t find a recent one that addresses the above, specifically.

For what it is worth, copilot suggested the following (and now rather obvious) solution: to modify the tokenize_function function as follows:

def tokenize_function(examples, col=col_in, max_length=768):
    tokenized_inputs = tokenizer(examples[col], truncation=True, padding='max_length', max_length=max_length)
    # Include the labels
    tokenized_inputs["labels"] = examples["label"]
​
    return tokenized_inputs

which allowed me to progress.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.