Hey @lewtun
I think I was able to train the model.
Here’s my code (just a few edits from the one you provided):
# Import libraries
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
datasets = load_dataset("super_glue", "boolq") # Use boolq as dataset for preliminary test
# Load model checkpoint
model_ckpt = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
boolq_enc = datasets.map(lambda x : tokenizer(x['question'], x['passage'], truncation="only_second"), batched=True)
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt)
# Define training parameters
batch_size = 8
args = TrainingArguments(
f"test-squad",
evaluation_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=1, # Set num_train_epochs to 1 as test
weight_decay=0.01,
)
from transformers import default_data_collator
data_collator = default_data_collator
trainer = Trainer(
model,
args,
train_dataset=boolq_enc["train"],
eval_dataset=boolq_enc["validation"],
# data_collator=data_collator, # I had to comment this because it was not working with the default collator
tokenizer=tokenizer,
)
trainer.train()
The training was ok (I reduced the number of epochs to 1 to speed up the process, since this is just a preliminary test), the only thing that bothers me is that I had to comment the data_collator parameter, because it was throwing an error. However, with that commented, the training seems to be fine.
Now I would like to:
text = r"""Persian (/ˈpɜːrʒən, -ʃən/), also known by its endonym Farsi (فارسی fārsi
(fɒːɾˈsiː) ( listen)), is one of the Western Iranian languages within the Indo-Iranian
branch of the Indo-European language family. It is primarily spoken in Iran, Afghanistan
(officially known as Dari since 1958), and Tajikistan (officially known as Tajiki
since the Soviet era), and some other regions which historically were Persianate
societies and considered part of Greater Iran. It is written in the Persian alphabet,
a modified variant of the Arabic script, which itself evolved from the Aramaic alphabet."""
questions = [
"do iran and afghanistan speak the same language"
]
for question in questions:
inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
# print(inputs["input_ids"].tolist())
text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
outputs = model(**inputs)
My assumption was that the 2 logits in the outputs
value represent yes and no, so that “translating” them to probabilities would give me the answer to my question. What is not clear to me is that if I try to ask a question on the same text which is the opposite (e.g. "do iran and afghanistan speak a different language"
) I would have expected the probabilities related to logits to be the opposite. What am I doing wrong?
Speaking about your model, it is not clear to me how it works. I quickly read the paper you pointed me, but from that I see that the inputs of a model for this task are a passage ad a question. I don’t see them in your model deployment. Moreover, what do LABEL_0 and LABEL_1 represent?
Thanks a lot and sorry for such a long post 