Hello,
I want to train a reformer for a sequence classification task. The sequences are of protein so I thought of making a new tokenizer and then loaded as a reformer tokenizer which is defined as below.
spm.SentencePieceTrainer.train(input='./sequences_scope.txt', model_prefix='REFORM', max_sentence_length=2000, vocab_size=25)
tokenizer = ReformerTokenizer("REFORM.model", padding=True)
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
The dataset was created as below -
with open("sequences_class_scope.csv", "w") as fp:
fp.write("idx,sequence,label\n")
for n,i in enumerate(sequences):
fp.write(str(n)+","+i+","+str(labels[n])+"\n")
dataset = load_dataset('csv', data_files='sequences_class_scope.csv', split='train[:60%]')
And the model was then defined as follows -
config = ReformerConfig(
vocab_size=25,
max_position_embeddings=2000,
num_attention_heads=12,
num_hidden_layers=6,
)
model = ReformerForSequenceClassification(config=config)
data_collator = DataCollatorForTokenClassification(
tokenizer=tokenizer, max_length = 2000
)
training_args = TrainingArguments(
output_dir="./pREFORMo",
overwrite_output_dir=True,
num_train_epochs=1,
per_gpu_train_batch_size=64,
save_steps=10_000,
save_total_limit=2,
prediction_loss_only=True,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
trainer.train()
I am currently getting the error ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ['label']
and if I map my dataset to include theinput_ids
and attention_masks
, I get the error
/usr/local/lib/python3.7/dist-packages/transformers/data/data_collator.py in <listcomp>(.0)
186 padding_side = self.tokenizer.padding_side
187 if padding_side == "right":
--> 188 batch["labels"] = [label + [self.label_pad_token_id] * (sequence_length - len(label)) for label in labels]
189 else:
190 batch["labels"] = [[self.label_pad_token_id] * (sequence_length - len(label)) + label for label in labels]
TypeError: object of type 'int' has no len()
Also, my end goal is to train the reformer and then use it to generate embeddings rather than classification. Is this the correct approach to do so?