Transfer learning to explore tasks' information requirements?

eritain · February 5, 2021, 12:19am

Continuing the discussion from ACL 2020 highlights – Joe:

This kind of question fascinates me. If intermediate training on Task A allows you to train target Task B more successfully, or if A and B as target tasks are affected in similar ways by each of several intermediate tasks, I’d strongly suspect that some of the same information is relevant to both A and B, and that the link between their respective successes (or failures) is the later layers of the encoder learning (or not learning) to fish that information out of the many other combinations of input features in the middle layers of the model.

I see it as complementary to probing experiments. When you determine that a word’s encoding predicts some linguistic or psycholinguistic object – its lexical semantics, its position in a parse tree, its reading time, the probability that a human reader will notice that its agreement morphology is wrong – you’re giving an exact description of one kind of information that can be found in a text when you know its language. Transfer learning experiments are (at least initially) working with far more opaque descriptions: “Information sufficient to (mostly) recreate human-like readings of anaphors, whatever that might be.” But I’m fascinated by the potential for the two approaches to meet in the middle: Using the probes as tasks that (theoretically) isolate one particular kind of knowledge, to dissect the “whatever that might be” and find that humanlike anaphora resolution depends heavily on X kind of information, lightly on Y kind, moderately on Z, and there’s this residue we haven’t explained yet, but we can see what other tasks it’s relevant to and take a guess.

A Transformer is, of course, not “wetware in disguise.” Not even structurally, let alone experientially. Finding the particular information that lets it imitate humans in some task is no guarantee that humans rely on the same information. If you want to uncover the cognitive particulars, how we do the task on an algorithmic level, BERT won’t tell you. But it can show us the shadow that the human algorithm casts onto the computational level, educate our guesses, help us prioritize our hypotheses. We’ll have to figure out whether X helps to predict human performance because we use X, or because X reflects a quirk of our processing that also affects our task performance, or what. But studying how this “hyperintelligent octopus” of ours gets around the atoll could at least indicate some of the currents that we too swim in.

(Sincere apologies to Bender and Koller for abusing their metaphor.)

On the techniques for studying transfer learning, I’ve had some discussions lately about the possibility of adversarial/amnesic intermediate tasks – using the training process to burn certain information out of the representations. Thinking about how to make sure that that happens, as opposed to just building a defiantly contrary task head, or a clueless one, or making the encoder all-around worse by flattening the representations. I have a bit of discussion about some of that in a feature request over on Github, and if you’ve read this far you’ll probably have some good ideas about it, so consider yourself invited! It’s at

github.com/huggingface/transformers

Adversarial/amnesic heads

opened 02:34AM - 04 Feb 21 UTC

eritain

Feature request

# 🚀 Feature request Task heads that backpropagate deliberately reversed gradi…ents to the encoder. A flag requesting this behavior when constructing a task head. ## Motivation Transfer learning experiments lend themselves to questions about the extent to which two tasks rely on the same information about a word/sentence, and to experiments probing whether and how word encodings contain/correspond to syntax trees, lemmas, frequencies, and other objects of linguistic/psycholinguistic study. A difficulty is that a pretrained model, without fine-tuning, may already encode certain information too thoroughly and accessibly for intermediate training to make much of a difference. For example, BERT's masked language modeling objective produces word encodings in which syntax information is readily accessible. Intermediate training on a syntax task requires training a task head to extract this information, of course, but it will result in very little reorganization of the encoder itself. Adversarial training, such as the amnesic probing of Elazar et al. 2020, can avoid this pitfall. Intermediate training can aim to burn particular information *out* of the encodings, and measure how much this impairs trainability of the target task. Strictly reversing the sense of the training data won't do it though; getting all the answers exactly wrong requires just as much domain knowledge as getting them all right does. And randomizing the labels on training data may just result in a feckless task head, one that discards useful information passed to it from the encoder, rather than affecting the encoder itself. Ideally, then, the task head would be trained toward correctly reproducing gold-standard labels, but would flip all its gradients before backpropagating them to the shared encoder, thus training it not to produce precisely the signals that the task head found most informative. The following work by Cory Shain illustrates flipping gradients in this way (although it's not applied to shared-encoder transfer learning, but rather to development of encoders that disentangle semantics from syntax). https://docs.google.com/presentation/d/1E89yZ8jXXeSARDLmlksOCJo83QZdNbd7phBrR_dRogg/edit#slide=id.g79452223cd_0_19 https://github.com/coryshain/synsemnet ## Your contribution I am deeply unfamiliar with pytorch, unfortunately, and utterly ignorant of tensorflow. I can't offer much.

Topic		Replies	Views
ACL 2020 highlights – Joe Research	3	1596	July 30, 2020
ACL 2020 - Some personal highlights - Victor Research	4	1367	July 14, 2020
Can I train a model to a different downstream task? Beginners	1	611	October 13, 2022
Intermediate Fine-tuning vs Domain Adaptive Pretraining vs Task Adaptive Pretraining Beginners	0	396	December 8, 2023
Regarding Training a Task Specific Knowledge Distillation model 🤗Transformers	8	3410	September 2, 2023

Transfer learning to explore tasks' information requirements?

Related topics