Training a reformer from scratch

Hello,

I want to train a reformer for a sequence classification task. The sequences are of protein so I thought of making a new tokenizer and then loaded as a reformer tokenizer which is defined as below.

spm.SentencePieceTrainer.train(input='./sequences_scope.txt', model_prefix='REFORM', max_sentence_length=2000, vocab_size=25)

tokenizer = ReformerTokenizer("REFORM.model", padding=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

The dataset was created as below -

with open("sequences_class_scope.csv", "w") as fp:
  fp.write("idx,sequence,label\n")
  for n,i in enumerate(sequences):
    fp.write(str(n)+","+i+","+str(labels[n])+"\n")

dataset = load_dataset('csv', data_files='sequences_class_scope.csv', split='train[:60%]')

And the model was then defined as follows -

config = ReformerConfig(
    vocab_size=25,
    max_position_embeddings=2000,
    num_attention_heads=12,
    num_hidden_layers=6,
)
model = ReformerForSequenceClassification(config=config)

data_collator = DataCollatorForTokenClassification(
    tokenizer=tokenizer, max_length = 2000
)
training_args = TrainingArguments(
    output_dir="./pREFORMo",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

trainer.train()

I am currently getting the error ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ['label'] and if I map my dataset to include theinput_ids and attention_masks, I get the error

/usr/local/lib/python3.7/dist-packages/transformers/data/data_collator.py in <listcomp>(.0)
    186         padding_side = self.tokenizer.padding_side
    187         if padding_side == "right":
--> 188             batch["labels"] = [label + [self.label_pad_token_id] * (sequence_length - len(label)) for label in labels]
    189         else:
    190             batch["labels"] = [[self.label_pad_token_id] * (sequence_length - len(label)) + label for label in labels]

TypeError: object of type 'int' has no len()

Also, my end goal is to train the reformer and then use it to generate embeddings rather than classification. Is this the correct approach to do so?

hey @choke what does an example in your tokenized inputs look like?

from the error, i think you could try renaming the target column to labels which is what the Trainer expects by default

@lewtun here is an example of a sequence after encoding.

{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'input_ids': [23, 20, 5, 17, 8, 16, 15, 4, 14, 18, 7, 4, 12, 5, 7, 7, 3, 3, 8, 17, 3, 6, 10, 11, 16, 18, 7, 13, 6, 4, 8, 19, 14, 3, 3, 9, 14, 12, 16, 14, 8, 10, 3, 11, 7, 11, 4, 13, 9, 17, 9, 17, 16, 3, 11, 17, 18, 3, 5, 5, 14, 14, 3, 18, 11, 7, 7, 19, 5, 19, 14, 20, 3, 13, 4, 13, 19, 3, 14, 16, 14, 12, 11, 15, 7, 13, 4, 10, 4, 21, 3, 8, 22, 20, 9, 10, 4, 20, 10, 19, 6, 5, 3, 7, 5, 7, 12, 13, 7, 16, 3, 16, 5, 13, 3, 7, 3, 11, 4, 13, 19, 20, 6, 15, 17, 11, 7, 4, 7, 10, 13, 8, 8], 'label': 0}

The renaming has no effect as can be seen below because both seem to be equivalent. The labels are meant to be integers from 0 to n-1 right, where n is the number of classes?

label_name = "label" if "label" in features[0].keys() else "labels"
labels = [feature[label_name] for feature in features] if label_name in features[0].keys() else None

So, I tried to one hot encode and adding the label in an array [label] so that len(label) works. This causes errors in the target array batch size and gives errors like ValueError: Expected input batch_size (64) to match target batch_size (262144)..

sorry i only just saw this - if your goal is to extract embeddings, then my suggestion would be to train / fine-tune the reformer as a language model on your corpus (no labels needed!). you can find tutorials on how to do this for both cases (from scratch vs fine-tune) here: 🤗 Transformers Notebooks

regarding your error, i now see you used DataCollatorForTokenClassification which is not the correct choice for sequence classification tasks. for these tasks, you can just use the default DataCollatorWithPadding which can be activated by passing the tokenizer to the Trainer:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer
)

@lewtun hey, I had got the code to work somehow. So, I using colab pro to train but since the runtime is limited, I am only able to train like 10 epochs. I wanted to know what is the general train time/epochs needed of such models and is there any way to train these models on a much smaller GPU?

Also, is would training a token classifier directly give the same results as training a language model and then fine-tuning it with a token classifier.