Some clarification on Conv1D

As far as I can tell, Conv1D (from transformers.pytorch_utils, used in the GPT-2 code a lot, for example) is just a linear layer. For example:

from transformers.pytorch_utils import Conv1D
import torch
import torch.nn.functional as F

in_features = 4
out_features = 8
batch_shape = (1, 12, 64)

conv = Conv1D(out_features, in_features)
W = torch.rand(conv.weight.shape, dtype=torch.float32)
b = torch.rand(conv.bias.shape, dtype=torch.float32)
conv.weight.data = W
conv.bias.data = b

x = torch.rand(*batch_shape, 5, in_features, dtype=torch.float32)

y_torch = F.linear(x, W.T, b)
y_hf = conv(x)

assert torch.allclose(y_hf, y_torch)

This script runs without errors, in other words the Conv1D gives the same output as the PyTorch linear function, just the weights are transposed. Now, it does mention in the docstring for Conv1D that this is exactly the case…

1D-convolutional layer as defined by Radford et al. for OpenAI GPT (and also used in GPT-2).

Basically works like a linear layer but the weights are transposed.

But still, this just amounts to a linear layer with a different convention for specifying the weights. So I have a few questions:

  1. Why not just use the PyTorch version, but transpose your weight matrix first?

  2. If the answer is that it’s a convenience wrapper, why not just call it LinearTransposed or even _linear or something?

  3. In any case, why call it a “convolutional” anything? There’s no convolution being calculated here, right? I mean, there are no kernels anywhere. The PyTorch Conv1D is even different from this one.

  4. What’s with the reference to Radford et al. (presumably this means the GPT-1 paper)? This object isn’t defined in either that paper or the GPT-2 paper as far as I can tell (obviously, a paper from 2018 wouldn’t have be defining a linear layer for the first time).

Maybe old code whose implementation changed but the class never got renamed?

5 Likes

I’m also intrigued by this Conv1D implementation. Just out of curiosity, I decided to check if the performance is different in the versions provided by @j3m and see if there were some differences. So, I just added the nn.Linear to the mix:

from transformers.pytorch_utils import Conv1D
import torch
import torch.nn.functional as F

in_features = 4
out_features = 8
batch_shape = (1, 12, 64)

conv = Conv1D(out_features, in_features)
W = torch.rand(conv.weight.shape, dtype=torch.float32)
b = torch.rand(conv.bias.shape, dtype=torch.float32)
conv.weight.data = W
conv.bias.data = b

x = torch.rand(*batch_shape, in_features, dtype=torch.float32)

y_torch = F.linear(x, W.T, b)
y_hf = conv(x)

assert torch.allclose(y_hf, y_torch)


linear = nn.Linear(in_features, out_features)
linear.weight.data = W.T
linear.bias.data = b

y_linear = linear(x) 

assert torch.allclose(y_linear, y_torch), "nn.Linear and F.linear outputs differ"

And timeit the three of them, all tests done in a HGX H100 single GPU, Python 3.10 Pytorch 2.2.2:

First for CPU:

#... Same code as original question=
W = W.T
#--------
%%timeit -n 1000
# Linear transformation using torch.nn.functional.linear
x_random = torch.rand(*batch_shape, 5, in_features, dtype=torch.float32)
y_torch = F.linear(x_random, W, b)
>>> 108 µs ± 661 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

#---------
%%timeit -n 1000
# Convolution transformation using transformers Conv1D
x_random = torch.rand(*batch_shape, 5, in_features, dtype=torch.float32)
y_hf = conv(x_random)
>>> 115 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

#---------
%%timeit -n 1000
# Linear transformation using torch.nn.Linear
x_random = torch.rand(*batch_shape, 5, in_features, dtype=torch.float32)

y_torch = linear(x_random)
>>> 114 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Then for GPU:

#... Same code as original question=

# Ensure the tensors and model are on GPU
x_gpu = x.cuda()
W_gpu = W.T.cuda()
b_gpu = b.cuda()
#---------
%%timeit -n 1000
x_random = torch.rand(*batch_shape, 5, in_features, dtype=torch.float32, device='cuda')
# Linear transformation using torch.nn.functional.linear on GPU
y_torch_gpu = F.linear(x_random, W_gpu, b_gpu)
>>>22.4 µs ± 956 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

#---------
conv.cuda()
#---------
%%timeit -n 1000 
# Generate random input on each iteration
x_random = torch.rand(*batch_shape, 5, in_features, dtype=torch.float32, device='cuda')
# Convolution transformation using transformers Conv1D on GPU
y_hf_gpu = conv(x_random)
>>> 27.7 µs ± 651 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

#--------
linear.cuda()
#---------
%%timeit -n 1000
# Linear transformation using torch.nn.Linear
x_random = torch.rand(*batch_shape, 5, in_features, dtype=torch.float32, device='cuda')

y_torch = linear(x_random)
>>> 27.8 µs ± 943 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

So, identical performance for Conv1D and nn.Linear, F.linear is faster as expected.

I also checked GPU ram used, restarting the kernel each time, and taking measurements just at the start and end of the %%timeit cell with:

# Get the amount of GPU memory that is currently allocated by PyTorch.
memory_allocated = torch.cuda.memory_allocated()
# Print the amount of memory in megabytes.
print(memory_allocated / 1024)
# Get the amount of GPU memory that is currently reserved by PyTorch.
memory_reserved = torch.cuda.memory_reserved()
# Print the amount of memory in megabytes.
print(memory_reserved / 1024)

And in both Conv1d and nn.Linear cases, the usage was identical to the Byte (Allocated/Reserved): 32769.0MB/34816.0MB and 32769.0MB/34816.0MB

Anyway, I was just curious, hope it helps someone and someone else can provide more details about this answering @j3m questions.