I have a 4070ti super, and I want to embed around 315k+ data locally. When I use my CPU the code below works fine, but when I set it to the GPU, i keep getting this CUDA error message
CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)
, even though my GPU’s VRAM is not completely used (I checked using task manager). I tried reinstalling everything, downgrading my GPU drivers to the one in cuda 12.4, but still no luck. Lowering the batch size and sentences size just lets it run a few iterations before the error occurs. What am I doing wrong here? Is my VRAM not being released after an iteration or something?
start = 0
inc = 64
iteration = 1
matryoshka_dim = 512
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
# gpu = 0
device = torch.device("cuda:0"if torch.cuda.is_available() else "cpu")
# torch.cuda.set_device(gpu)
# device = torch.device("cpu")
for i in tqdm(range(start, len(rows), inc)):
end = min(i + inc, len(rows))
# print(start, end)
sentences = rows[start:end]
embeddings = model.encode(sentences, convert_to_tensor=True, device=device)
embeddings = F.layer_norm(embeddings, normalized_shape=(embeddings.shape[1],))
embeddings = embeddings[:, :matryoshka_dim]
embeddings = F.normalize(embeddings, p=2, dim=1)
# write to file in fk_ro_v
with open("./fk_ro_v/ro_" + str(iteration) + ".pkl", "wb") as f:
pickle.dump(embeddings, f)
torch.cuda.empty_cache()
iteration += 1
start += inc
VRAM Usage
torch 2.5.0.dev20240715+cu124
torchaudio 2.4.0.dev20240715+cu124
torchvision 0.20.0.dev20240715+cu124