Facing Tensor Size issue in Tranformer tool

J-JEMINA · September 11, 2023, 6:12pm

I am using PhageRBP detection which uses bioembeddings tool. In that, I am trying to create embeddings for a genome of e.coli phage of size 108485kb but I get this error immediately. I am working in Google Colab since I am a complete beginner and only comfortable working in that

The expanded size of the tensor (108485) must match the existing size (40000) at non-singleton dimension 1. Target sizes: [1, 108485]. Tensor sizes: [1, 40000]. This most likely means that you don’t have enough GPU RAM to embed a protein this long.

As I tried to navigate, I realized it originates from the transformer tool that they are trying to use. I tried seeing if chatgpt can be of any use but it just says the expand function cannot expand beyond 40000. I am willing to pay for Colab Pro or Pro+ as well if that would solve this issue. I have 100s of genomes of similar size to run as well. I do not know how to solve this.

Can someone help me with how to resolve this, please?

Any help would be appreciated.

Thank you!

Sandy1857 · September 12, 2023, 2:05pm

Hey there. A newbie here myself, so please forgive any mistakes in my answer.

First of all, I don’t know the model you are using to generate the embeddings or the implementation, but still I hope the following can give some clarity.

Typically, any transformer model can process only a limited amount of sequence length, because the computation of Q,K,V matrices (for self-attention) goes up quadratically with increase in sequence length and require much resources to train.

That is why we just take some maximum value say 1024 or 512 for the tokens and pre-train the model. So the model can take in only that much. If your sequences are longer, you have no option but to truncate the sequence and process only upto the maximum the model can take in.

I think you issue is not of OOM (Out-of-Memory) but that of longer sequences than model can process. you may pass truncation=True parameter, in your tokenizer (if using) or preprocess the sequence upto the maximum length of model.

J-JEMINA · October 12, 2023, 7:23pm

Thank you so much for your help!
I preprocessed the sequence without affecting the integrity and it worked like a charm!
Thank you for the suggestion!

Sandy1857 · October 13, 2023, 7:39am

Happy to help. . Please mask as solved.

Topic		Replies	Views
Token indices sequence length is longer than the specified maximum sequence length for this model 🤗Transformers	1	5420	July 21, 2023
Tensor size error when generating embeddings for documents using pre-trained models 🤗Transformers	3	518	April 11, 2024
RuntimeError: The size of tensor a (4096) must match the size of tensor b (4097) at non-singleton dimension 3 Models	1	432	August 24, 2024
Out of Memory on very small custom transformer Models	7	2126	October 12, 2020
RuntimeError: The expanded size of the tensor (31) must match the existing size (7) at non-singleton dimension 0. Target sizes: [31]. Tensor sizes: [7] Beginners	0	184	May 23, 2024

Facing Tensor Size issue in Tranformer tool

Related topics