Question answering bot: yes/no answers

Hello everybody

Following this guide I was able to fine-tune an already fine-tuned model for Question Answering (this one).

Now I am wondering if it is possible for a QAbot to produce “yes”/“no” as an answer, and if this is the case, how would it be done? The guide clearly states that it is not possible for this kind of models to get an answer that is not explicitly in the text (because that would mean to generate text).

So my questions are:

  • Can I adjust the code provided in the notebook for training a model that is able to create a yes/no answer?
  • Maybe I should use another model, or the same model but provide a different training?
  • If I have to use another model, is it possible to join them in some way?
  • How should data for training yes/no answers be provided? Could you point to some docs/examples?

Thanks a lot!

1 Like

Hey @Neuroinformatica, if you already have labelled data then my suggestion would be to frame the problem as an entailment one, i.e. given a (question, passage) predict a boolean value for yes/no.

This is the approach taken in the BoolQ paper and fine-tuning models on this is pretty straight-forward, e.g.

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification

boolq = load_dataset("super_glue", "boolq")
model_ckpt = ...
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
boolq_enc = boolq.map(lambda x : tokenizer(x['question'], x['passage'], truncation="only_second"), batched=True)
model = AutoModelForSequenceClassification.from_pretrained(model_ckpt)
# fine-tune with Trainer or whatever method ...

I’ve fine-tuned a few BERT models this way on the Hub, e.g. here: lewtun/bert-large-uncased-wwm-finetuned-boolq · Hugging Face

3 Likes

Many thanks @lewtun for you quick and on-point answer as always!
I will try that and let you know :slight_smile:

1 Like

Hey @lewtun

I think I was able to train the model.
Here’s my code (just a few edits from the one you provided):


# Import libraries
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

datasets = load_dataset("super_glue", "boolq") # Use boolq as dataset for preliminary test

# Load model checkpoint
model_ckpt = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
boolq_enc = datasets.map(lambda x : tokenizer(x['question'], x['passage'], truncation="only_second"), batched=True)

model = AutoModelForSequenceClassification.from_pretrained(model_ckpt)

# Define training parameters
batch_size = 8
args = TrainingArguments(
    f"test-squad",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=1, # Set num_train_epochs to 1 as test
    weight_decay=0.01,
)

from transformers import default_data_collator
data_collator = default_data_collator

trainer = Trainer(
    model,
    args,
    train_dataset=boolq_enc["train"],
    eval_dataset=boolq_enc["validation"],
    # data_collator=data_collator, # I had to comment this because it was not working with the default collator
    tokenizer=tokenizer,
)

trainer.train()

The training was ok (I reduced the number of epochs to 1 to speed up the process, since this is just a preliminary test), the only thing that bothers me is that I had to comment the data_collator parameter, because it was throwing an error. However, with that commented, the training seems to be fine.

Now I would like to:

  • Evaluate the model to get the accuracy score: how can I do that?

  • Use the model to get answers from single questions. I tried to use a code similar to the one used for a normal QAbot:

text = r"""Persian (/ˈpɜːrʒən, -ʃən/), also known by its endonym Farsi (فارسی fārsi
(fɒːɾˈsiː) ( listen)), is one of the Western Iranian languages within the Indo-Iranian
branch of the Indo-European language family. It is primarily spoken in Iran, Afghanistan
(officially known as Dari since 1958), and Tajikistan (officially known as Tajiki
since the Soviet era), and some other regions which historically were Persianate
societies and considered part of Greater Iran. It is written in the Persian alphabet,
a modified variant of the Arabic script, which itself evolved from the Aramaic alphabet."""

questions = [
    "do iran and afghanistan speak the same language"
]

for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]
    # print(inputs["input_ids"].tolist())
    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    outputs = model(**inputs)

My assumption was that the 2 logits in the outputs value represent yes and no, so that “translating” them to probabilities would give me the answer to my question. What is not clear to me is that if I try to ask a question on the same text which is the opposite (e.g. "do iran and afghanistan speak a different language") I would have expected the probabilities related to logits to be the opposite. What am I doing wrong?


Speaking about your model, it is not clear to me how it works. I quickly read the paper you pointed me, but from that I see that the inputs of a model for this task are a passage ad a question. I don’t see them in your model deployment. Moreover, what do LABEL_0 and LABEL_1 represent?

Thanks a lot and sorry for such a long post :sweat_smile:

Hey @Neuroinformatica,

the only thing that bothers me is that I had to comment the data_collator parameter, because it was throwing an error. However, with that commented, the training seems to be fine.

Ah that’s because for sequence classification tasks you need to either pad the examples during the tokenization step or do it on-the-fly with the Trainer. By default, the Trainer uses DataCollatorWithPadding which is why commenting out data_collator works in your case, but fails when you try to use default_data_collator (which does no padding). You can find some extra info here: Using data collators for training and error analysis | Lewis Tunstall’s Blog

Evaluate the model to get the accuracy score: how can I do that?

You can pass the accuracy metric to the Trainer via the compute_metrics function:

from datasets import load_metric
import numpy as np

accuracy_score = load_metric('accuracy')

def compute_metrics(pred):
    predictions, labels = pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy_score.compute(predictions=predictions, references=labels)

This will compute the accuracy during the evaluation step of training.

My assumption was that the 2 logits in the outputs value represent yes and no, so that “translating” them to probabilities would give me the answer to my question

Yes this is correct and your code makes sense. A slightly more elegant approach could be to wrap the preprocessing / inference in a custom Pipeline class, so then you can pass string inputs and get class probabilities in one go.

What is not clear to me is that if I try to ask a question on the same text which is the opposite (e.g. "do iran and afghanistan speak a different language" ) I would have expected the probabilities related to logits to be the opposite. What am I doing wrong?

I think your example might be a bit tricky because the answer is ambiguous based on the text. If you use negation like “do iran and afganistan not speak the same language” then you get a bigger weighting on the “no” class.

Speaking about your model, it is not clear to me how it works. I quickly read the paper you pointed me, but from that I see that the inputs of a model for this task are a passage ad a question. I don’t see them in your model deployment. Moreover, what do LABEL_0 and LABEL_1 represent?

Yes this seems to be a limitation from the Hub API which does not seem to be able to handle NLI tasks as far as I can tell. So right now, it treats (question, passage) inputs as independent examples that it assigns a yes/no label to. LABEL_0 and LABEL_1 represent default labels used the TextClassificationPipeline so I really should have specified in the model config the mapping from yes → 1 and no → 1. I’ll update the models when I get a bit of free time :slight_smile:

2 Likes

Hi @lewtun, I don’t understand why you load the ‘super_glue’ dataset along with boolq’

Hi @lewtun,

Thanks for the interesting discussion and answers.
I have a similar question here:

How can I evaluate an existing model trained on boolq dataset, WITHOUT retraining it?
I’m trying the “evaluate” package of HF, and the question-answering evaluator, but I got some errors.

Here’s my main code:

from transformers import AutoModelWithHeads
import torch

from datasets import load_dataset
import evaluate
from evaluate import evaluator

model = AutoModelWithHeads.from_pretrained("roberta-base")
adapter_name = model.load_adapter("AdapterHub/roberta-base-pf-boolq", source="hf")
model.active_adapters = adapter_name

from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

eval = evaluator("question-answering")
results = eval.compute(model_or_pipeline=model, data="boolq", metric="accuracy", 
                      question_column="question", context_column="passage", 
                       id_column=None, label_column="answer")

It gave me this error:

/opt/anaconda3/envs/hugging-face/lib/python3.7/site-packages/evaluate/evaluator/question_answering.py in compute(self, model_or_pipeline, data, subset, split, metric, tokenizer, strategy, confidence_level, n_resamples, device, random_state, question_column, context_column, id_column, label_column, squad_v2_format)
    189             context_column=context_column,
    190             id_column=id_column,
--> 191             label_column=label_column,
    192         )
    193 

/opt/anaconda3/envs/hugging-face/lib/python3.7/site-packages/evaluate/evaluator/question_answering.py in prepare_data(self, data, question_column, context_column, id_column, label_column)
    104                 "context_column": context_column,
    105                 "id_column": id_column,
--> 106                 "label_column": label_column,
    107             },
    108         )

/opt/anaconda3/envs/hugging-face/lib/python3.7/site-packages/evaluate/evaluator/base.py in check_required_columns(self, data, columns_names)
    301             if column_name not in data.column_names:
    302                 raise ValueError(
--> 303                     f"Invalid `{input_name}` {column_name} specified. The dataset contains the following columns: {data.column_names}."
    304                 )
    305 

ValueError: Invalid `id_column` None specified. The dataset contains the following columns: ['question', 'answer', 'passage'].

I’m not sure I’m in the right direction in evaluating a boolq model.
Please advise. Thanks a lot!