Some clarification on Conv1D

As far as I can tell, Conv1D (from transformers.pytorch_utils, used in the GPT-2 code a lot, for example) is just a linear layer. For example:

from transformers.pytorch_utils import Conv1D
import torch
import torch.nn.functional as F

in_features = 4
out_features = 8
batch_shape = (1, 12, 64)

conv = Conv1D(out_features, in_features)
W = torch.rand(conv.weight.shape, dtype=torch.float32)
b = torch.rand(conv.bias.shape, dtype=torch.float32) = W = b

x = torch.rand(*batch_shape, 5, in_features, dtype=torch.float32)

y_torch = F.linear(x, W.T, b)
y_hf = conv(x)

assert torch.allclose(y_hf, y_torch)

This script runs without errors, in other words the Conv1D gives the same output as the PyTorch linear function, just the weights are transposed. Now, it does mention in the docstring for Conv1D that this is exactly the case…

1D-convolutional layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2).

Basically works like a linear layer but the weights are transposed.

But still, this just amounts to a linear layer with a different convention for specifying the weights. So I have a few questions:

  1. Why not just use the PyTorch version, but transpose your weight matrix first?

  2. If the answer is that it’s a convenience wrapper, why not just call it LinearTransposed or even _linear or something?

  3. In any case, why call it a “convolutional” anything? There’s no convolution being calculated here, right? I mean, there are no kernels anywhere. The PyTorch Conv1D is even different from this one.

  4. What’s with the reference to Radford et al. (presumably this means the GPT-1 paper)? This object isn’t defined in either that paper or the GPT-2 paper as far as I can tell (obviously, a paper from 2018 wouldn’t have be defining a linear layer for the first time).

Maybe old code whose implementation changed but the class never got renamed?