Training a reformer from scratch

choke · July 12, 2021, 11:16pm

Hello,

I want to train a reformer for a sequence classification task. The sequences are of protein so I thought of making a new tokenizer and then loaded as a reformer tokenizer which is defined as below.

spm.SentencePieceTrainer.train(input='./sequences_scope.txt', model_prefix='REFORM', max_sentence_length=2000, vocab_size=25)

tokenizer = ReformerTokenizer("REFORM.model", padding=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

The dataset was created as below -

with open("sequences_class_scope.csv", "w") as fp:
  fp.write("idx,sequence,label\n")
  for n,i in enumerate(sequences):
    fp.write(str(n)+","+i+","+str(labels[n])+"\n")

dataset = load_dataset('csv', data_files='sequences_class_scope.csv', split='train[:60%]')

And the model was then defined as follows -

config = ReformerConfig(
    vocab_size=25,
    max_position_embeddings=2000,
    num_attention_heads=12,
    num_hidden_layers=6,
)
model = ReformerForSequenceClassification(config=config)

data_collator = DataCollatorForTokenClassification(
    tokenizer=tokenizer, max_length = 2000
)
training_args = TrainingArguments(
    output_dir="./pREFORMo",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

trainer.train()

I am currently getting the error ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ['label'] and if I map my dataset to include theinput_ids and attention_masks, I get the error

/usr/local/lib/python3.7/dist-packages/transformers/data/data_collator.py in <listcomp>(.0)
    186         padding_side = self.tokenizer.padding_side
    187         if padding_side == "right":
--> 188             batch["labels"] = [label + [self.label_pad_token_id] * (sequence_length - len(label)) for label in labels]
    189         else:
    190             batch["labels"] = [[self.label_pad_token_id] * (sequence_length - len(label)) + label for label in labels]

TypeError: object of type 'int' has no len()

Also, my end goal is to train the reformer and then use it to generate embeddings rather than classification. Is this the correct approach to do so?

lewtun · July 13, 2021, 12:18pm

hey @choke what does an example in your tokenized inputs look like?

from the error, i think you could try renaming the target column to labels which is what the Trainer expects by default

choke · July 13, 2021, 10:55pm

@lewtun here is an example of a sequence after encoding.

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [23, 20, 5, 17, 8, 16, 15, 4, 14, 18, 7, 4, 12, 5, 7, 7, 3, 3, 8, 17, 3, 6, 10, 11, 16, 18, 7, 13, 6, 4, 8, 19, 14, 3, 3, 9, 14, 12, 16, 14, 8, 10, 3, 11, 7, 11, 4, 13, 9, 17, 9, 17, 16, 3, 11, 17, 18, 3, 5, 5, 14, 14, 3, 18, 11, 7, 7, 19, 5, 19, 14, 20, 3, 13, 4, 13, 19, 3, 14, 16, 14, 12, 11, 15, 7, 13, 4, 10, 4, 21, 3, 8, 22, 20, 9, 10, 4, 20, 10, 19, 6, 5, 3, 7, 5, 7, 12, 13, 7, 16, 3, 16, 5, 13, 3, 7, 3, 11, 4, 13, 19, 20, 6, 15, 17, 11, 7, 4, 7, 10, 13, 8, 8], 'label': 0}

The renaming has no effect as can be seen below because both seem to be equivalent. The labels are meant to be integers from 0 to n-1 right, where n is the number of classes?

label_name = "label" if "label" in features[0].keys() else "labels"
labels = [feature[label_name] for feature in features] if label_name in features[0].keys() else None

So, I tried to one hot encode and adding the label in an array [label] so that len(label) works. This causes errors in the target array batch size and gives errors like ValueError: Expected input batch_size (64) to match target batch_size (262144)..

lewtun · July 19, 2021, 10:24am

sorry i only just saw this - if your goal is to extract embeddings, then my suggestion would be to train / fine-tune the reformer as a language model on your corpus (no labels needed!). you can find tutorials on how to do this for both cases (from scratch vs fine-tune) here: 🤗 Transformers Notebooks

regarding your error, i now see you used DataCollatorForTokenClassification which is not the correct choice for sequence classification tasks. for these tasks, you can just use the default DataCollatorWithPadding which can be activated by passing the tokenizer to the Trainer:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer
)

choke · July 19, 2021, 5:29pm

@lewtun hey, I had got the code to work somehow. So, I using colab pro to train but since the runtime is limited, I am only able to train like 10 epochs. I wanted to know what is the general train time/epochs needed of such models and is there any way to train these models on a much smaller GPU?

choke · July 20, 2021, 10:06pm

Also, is would training a token classifier directly give the same results as training a language model and then fine-tuning it with a token classifier.

Topic		Replies	Views
Any Pre-trained reformer model available for classification fine tuning 🤗Transformers	4	1177	August 10, 2020
RoFormer (Eng language) Models	4	37	May 17, 2025
Reformer for Sequence Classification Models	0	462	June 21, 2022
How to tokenize large contexts without running out of memory 🤗Tokenizers	2	1609	August 8, 2022
Training ALBERT from scratch with Distributed Training 🤗Transformers	0	1705	September 25, 2020

Training a reformer from scratch

Related topics