How can I evaluate my fine-tuned model on Squad?


I have loaded the already finetune model for squad 'twmkn9/bert-base-uncased-squad2'
I would like to now evaluate it on the SQuAD2 dataset, how would I do that?
This is my code currently;

from transformers import AutoTokenizer, AutoModelForQuestionAnswering, AutoConfig

model_name = 'twmkn9/bert-base-uncased-squad2'

config = AutoConfig.from_pretrained(model_name, num_hidden_layers=10)
tokenizer = AutoTokenizer.from_pretrained(model_name)  
model = AutoModelForQuestionAnswering.from_config(config)

Now I am just unsure what to do next?


`i am trying to follow the instructions from here. Yet I am unsure how to use my own model. This is what I have:

# Grab the script

!curl -L -O

!curl -L -O

!curl -L -O 

!python  \
    --model_type bert   \
    --model_name_or_path model  \
    --output_dir models/distilbert/twmkn9_distilbert-base-uncased-squad2 \
    --data_dir data/squad   \
    --predict_file dev-v2.0.json   \
    --do_eval   \
    --version_2_with_negative \
    --do_lower_case  \
    --per_gpu_eval_batch_size 12   \
    --max_seq_length 384   \
    --doc_stride 128

But I am getting an error

2021-05-04 11:14:04.086537: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
Traceback (most recent call last):
  File "", line 613, in <module>
  File "", line 208, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/usr/local/lib/python3.7/dist-packages/transformers/", line 187, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 20, in __init__
  File "", line 184, in __post_init__
    raise ValueError("Need either a dataset name or a training/validation file/test_file.")
ValueError: Need either a dataset name or a training/validation file/test_file.

I imagine because --model_name_or_path model \ is no good, but then how do I call my own configured model?

Thank you

The error comes before your model (you will get one for the model after :wink: ), first you have to replace --predict_file by --test_file in your command. Then if your model is in models/distilbert/twmkn9_distilbert-base-uncased-squad2, this is what you should pass (the model_type argument is useless, it will be inferred from the model files).

1 Like

Thank you!
will calling models/distilbert/twmkn9_distilbert-base-uncased-squad2 call my model which I configured? I haven’t saved it anywhere, just called it in my notebook (like my code above)

Then you need to pass twmkn9/bert-base-uncased-squad2

1 Like

Cool! just to try and gain an understanding, how does this work? As in how does it know through this command to call my configured model (which had layers cut off it)?


This seemed to call the original model, nit my configured one.

This is what im getting for the initialisation

[INFO|] 2021-05-04 13:19:23,488 >> Model config BertConfig {
  "architectures": [
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.6.0.dev0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522

But the “num_hidden_layers” should be 10 according to my configuration

I’m really confused. How do you want to call your modified model if you did not save it anywhere?

Fair enough, that’s what I was unsure about.
Would saving it in the Colab be enough? because I wouldn’t want to upload these to HuggingFace…

Alternatively I was thinking maybe there was a way to change the model’s configuration with the script

@sgugger thank you so much, got it all working…now I have an issue for training time though. I froze all layers but the head and am trying to train using the script. Its working but giving me an estimated time of 8 hours! Is that normal?

(to do this, I configured my model to have 10 layers, added

for param in model.base_model.parameters():
    param.requires_grad = False

to freeze the layers and then saved the model. I loaded this model in the script)

No you need to add this freezing in the script. Reloading the model will remove the freezing.

wow! what can’t this script do?

Sorry to bother you and thank you for being so helpful, but how would do that?
I looked through, and and couldn’t find a requires_grad parameter to change?

No, you need to add the lines of code inside the example script. It’s not baked inside.

Hi! So I added those lines in the file just before the trainer is called.
I trained (which took 1.5 hours) but then got terrible f1 results (5.9) :sob:
I hypothesised that using 10 instead of 12 layers would end up with a worse output but not this much worse?
Is it possible that even the qa_head was frozen and nothing was training at all?
Was wondering if you have any ideas…

You are not using a pretrained model anymore by reducing the numbers of layers, so why would you freeze parameters? This only makes sense when fine-tuning a model, not training from scratch.

Hi, basically I would like to see what BERT learns across the different layers (how much it already has figured out by the nth layer).
So I was thinking to take a fine-tuned model for QA, cut lets say the last 2 layers. I was assuming that the rest of the layers keep their old weights.
Now it has 10 layers + q_a head. Now I freeze all the layers and fine-tune it so that the final layer ‘connects’ to the qa_head so that I can probe how much that layer knows.
Does that make sense for such an experiment?