Continuing the discussion from ACL 2020 highlights – Joe:
This kind of question fascinates me. If intermediate training on Task A allows you to train target Task B more successfully, or if A and B as target tasks are affected in similar ways by each of several intermediate tasks, I’d strongly suspect that some of the same information is relevant to both A and B, and that the link between their respective successes (or failures) is the later layers of the encoder learning (or not learning) to fish that information out of the many other combinations of input features in the middle layers of the model.
I see it as complementary to probing experiments. When you determine that a word’s encoding predicts some linguistic or psycholinguistic object – its lexical semantics, its position in a parse tree, its reading time, the probability that a human reader will notice that its agreement morphology is wrong – you’re giving an exact description of one kind of information that can be found in a text when you know its language. Transfer learning experiments are (at least initially) working with far more opaque descriptions: “Information sufficient to (mostly) recreate human-like readings of anaphors, whatever that might be.” But I’m fascinated by the potential for the two approaches to meet in the middle: Using the probes as tasks that (theoretically) isolate one particular kind of knowledge, to dissect the “whatever that might be” and find that humanlike anaphora resolution depends heavily on X kind of information, lightly on Y kind, moderately on Z, and there’s this residue we haven’t explained yet, but we can see what other tasks it’s relevant to and take a guess.
A Transformer is, of course, not “wetware in disguise.” Not even structurally, let alone experientially. Finding the particular information that lets it imitate humans in some task is no guarantee that humans rely on the same information. If you want to uncover the cognitive particulars, how we do the task on an algorithmic level, BERT won’t tell you. But it can show us the shadow that the human algorithm casts onto the computational level, educate our guesses, help us prioritize our hypotheses. We’ll have to figure out whether X helps to predict human performance because we use X, or because X reflects a quirk of our processing that also affects our task performance, or what. But studying how this “hyperintelligent octopus” of ours gets around the atoll could at least indicate some of the currents that we too swim in.
(Sincere apologies to Bender and Koller for abusing their metaphor.)
On the techniques for studying transfer learning, I’ve had some discussions lately about the possibility of adversarial/amnesic intermediate tasks – using the training process to burn certain information out of the representations. Thinking about how to make sure that that happens, as opposed to just building a defiantly contrary task head, or a clueless one, or making the encoder all-around worse by flattening the representations. I have a bit of discussion about some of that in a feature request over on Github, and if you’ve read this far you’ll probably have some good ideas about it, so consider yourself invited! It’s at