Is BigScience T0, BLOOM or BLOOMZ Better at Zero-Shot/Few-Shot Question Answering in English?

I am interested in using an LLM like BLOOM for English-only question answering tasks in a zero-shot or few-shot learning setting. However, I have noticed that in the paper, “BLOOM: A 176B-Parameter Open-Access Multilingual Language Model,” that BLOOMZ performed better than BLOOM for zero-shot task-generalization (in terms of Natural Language Inference, Coreference Resolution, and Sentence Completion benchmarks), and T0 performed better than BLOOM on the majority of the SuperGLUE benchmark (in terms of general language understanding) tasks for 0-shot and 1-shot learning.

I have 3 questions:

  1. I’m not too familiar with the benchmarks these models were tested on, but it seems they are not specifically related to Question Answering. Is it fair to say that at this point it is unknown which of these three models would likely perform best on English Question Answering tasks? Or would one of them most likely perform better than the others? For example, even if T0 and BLOOMZ are not specifically fine-tuned for Question Answering tasks, does the fact that they are fine-tuned on a variety of tasks still make these models more likely to generalize better/perform better on Question Answering than would the BLOOM model? If so, would T0 or BLOOMZ likely be better and why?

  2. I am unclear why fine-tuned models like T0 and BLOOMZ generally perform better than their regular non-fine-tuned LLM counterparts. It is my understanding that BLOOM, similar to GPT-3, is based on in-context learning/meta-learning, and therefore (probably in part based on the huge number of parameters and tokens it has been trained on) is well-suited to perform well on a large variety of tasks. If so, why is it that fine-tuning it would further increase its ability to generalize better to a larger variety of tasks?

  3. Does the fact that BLOOM was trained on many more languages compared to Large Language Models like GPT-3 mean that if BLOOM’s task descriptions, support set examples, and prompts are all in English and the model’s response is in English, that BLOOM would likely not perform as well as GPT-3? For example, 30.04% of the tokens BLOOM was trained on were in English. Since BLOOM is a 176 billion parameter model, that would suggest 52.9 billion parameters come from English words. In contrast, GPT-3 is a 175 billion parameter model, and the vast majority of tokens it was trained on were in English.