Environment info
-
transformers
version: 4.8.2 - Platform: macOS-10.16-x86_64-i386-64bit
- Python version: 3.8.3
- PyTorch version (GPU?): 1.9.0 (False)
- Tensorflow version (GPU?): 2.2.0-rc3 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Models:
- CANINE Transformer
Model I am using is CANINE
I have a obfuscated documents consists around 30000 sentences and each has some labels too (in total 12 labels) - Multi Class Classification problem
(The data has been obfuscated, however the patterns in them are preserved)
A single record look like this:
satwamuluhqgulamlrmvezuhqvkrpmletwulcitwskuhlemvtwamuluhiwiwenuhlrvimvqvkruhulenamuluhqgqvtwvimviwuhtwamuluhulqvkrenamcitwuhvipmpmqvuhskiwkrpmdfuhlrvimvskvikrpmqvuhskmvgzenleuhqvmvamuluhulenamuluhqvletwtwvipmpmgzleenamuhtwamuluhtwletwdfuhiwkrxeleentwxeuhpmqvuhtwiwmvamdfuhpkeztwamuluhvimvuhqvtwmkpmpmlelruhgztwtwskuhtwlrkrpmlruhpmuluhqvenuhtwyplepmxeuhenuhamypkrqvuhamulmvdfuhqvskentwamletwlrlrpmiwuhtwamul
So I am decided to try the CANINE since its works on the character encoding principle. But i am facing some issues, I have attached the code and exceptions.
with open('xtrain_obfuscated.txt') as f:
x = f.read().splitlines()
with open('ytrain.txt') as f:
y = f.read().splitlines()
import torch
from transformers import CanineConfig, CanineForSequenceClassification, CanineForMultipleChoice, CanineForTokenClassification
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.2)
from transformers import CanineTokenizer, CanineModel
from transformers import Trainer, TrainingArguments, CanineForMultipleChoice
tokenizer = CanineTokenizer(model_max_length=512)
tokens_train = tokenizer(x_train, padding='longest', return_tensors='pt')
tokens_val = tokenizer(x_val, padding='longest', return_tensors='pt')
class NovelClassificationDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: val[idx] for key, val in self.encodings.items()}
item['labels'] = torch.tensor(int(self.labels[idx]))
return item
def __len__(self):
#print(len(self.labels))
return len(self.labels)
train_dataset = NovelClassificationDataset(tokens_train, y_train)
val_dataset = NovelClassificationDataset(tokens_val, y_val)
model = CanineForSequenceClassification.from_pretrained("google/canine-s", num_labels=12, problem_type="multi_label_classification")
training_args = TrainingArguments(
output_dir='./results', # output directory
num_train_epochs=10, # total number of training epochs
per_device_train_batch_size=13, # batch size per device during training
per_device_eval_batch_size=32, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
logging_steps=10,
)
trainer = Trainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset # evaluation dataset
)
trainer.train()
Exception is
~/opt/anaconda3/envs/task/lib/python3.8/site-packages/torch/nn/functional.py in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
2578
2579 if not (target.size() == input.size()):
-> 2580 raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
2581
2582 return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
ValueError: Target size (torch.Size([13])) must be the same as input size (torch.Size([13, 12]))