Obfuscated text classification error when using CANINE Transformers

Environment info

  • transformers version: 4.8.2
  • Platform: macOS-10.16-x86_64-i386-64bit
  • Python version: 3.8.3
  • PyTorch version (GPU?): 1.9.0 (False)
  • Tensorflow version (GPU?): 2.2.0-rc3 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Models:

  • CANINE Transformer

Model I am using is CANINE

I have a obfuscated documents consists around 30000 sentences and each has some labels too (in total 12 labels) - Multi Class Classification problem
(The data has been obfuscated, however the patterns in them are preserved)

A single record look like this:

satwamuluhqgulamlrmvezuhqvkrpmletwulcitwskuhlemvtwamuluhiwiwenuhlrvimvqvkruhulenamuluhqgqvtwvimviwuhtwamuluhulqvkrenamcitwuhvipmpmqvuhskiwkrpmdfuhlrvimvskvikrpmqvuhskmvgzenleuhqvmvamuluhulenamuluhqvletwtwvipmpmgzleenamuhtwamuluhtwletwdfuhiwkrxeleentwxeuhpmqvuhtwiwmvamdfuhpkeztwamuluhvimvuhqvtwmkpmpmlelruhgztwtwskuhtwlrkrpmlruhpmuluhqvenuhtwyplepmxeuhenuhamypkrqvuhamulmvdfuhqvskentwamletwlrlrpmiwuhtwamul

So I am decided to try the CANINE since its works on the character encoding principle. But i am facing some issues, I have attached the code and exceptions.

with open('xtrain_obfuscated.txt') as f:
    x = f.read().splitlines()
with open('ytrain.txt') as f:
    y = f.read().splitlines()

import torch
from transformers import CanineConfig, CanineForSequenceClassification, CanineForMultipleChoice, CanineForTokenClassification


from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.2)
from transformers import CanineTokenizer, CanineModel
from transformers import Trainer, TrainingArguments, CanineForMultipleChoice
tokenizer = CanineTokenizer(model_max_length=512)

tokens_train = tokenizer(x_train, padding='longest', return_tensors='pt')
tokens_val = tokenizer(x_val, padding='longest', return_tensors='pt')

class NovelClassificationDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
        
    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(int(self.labels[idx]))
       
        return item
        
    def __len__(self):
        #print(len(self.labels))
        return len(self.labels)


train_dataset = NovelClassificationDataset(tokens_train, y_train)
val_dataset = NovelClassificationDataset(tokens_val, y_val)
model = CanineForSequenceClassification.from_pretrained("google/canine-s", num_labels=12, problem_type="multi_label_classification")


training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=10,              # total number of training epochs
    per_device_train_batch_size=13,  # batch size per device during training
    per_device_eval_batch_size=32,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

Exception is

~/opt/anaconda3/envs/task/lib/python3.8/site-packages/torch/nn/functional.py in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
   2578 
   2579     if not (target.size() == input.size()):
-> 2580         raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
   2581 
   2582     return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)

ValueError: Target size (torch.Size([13])) must be the same as input size (torch.Size([13, 12]))