I am using PhageRBP detection which uses bioembeddings tool. In that, I am trying to create embeddings for a genome of e.coli phage of size 108485kb but I get this error immediately. I am working in Google Colab since I am a complete beginner and only comfortable working in that

The expanded size of the tensor (108485) must match the existing size (40000) at non-singleton dimension 1. Target sizes: [1, 108485]. Tensor sizes: [1, 40000]. This most likely means that you don’t have enough GPU RAM to embed a protein this long.

As I tried to navigate, I realized it originates from the transformer tool that they are trying to use. I tried seeing if chatgpt can be of any use but it just says the expand function cannot expand beyond 40000. I am willing to pay for Colab Pro or Pro+ as well if that would solve this issue. I have 100s of genomes of similar size to run as well. I do not know how to solve this.

Can someone help me with how to resolve this, please?

Any help would be appreciated.

Thank you!

Hey there. A newbie here myself, so please forgive any mistakes in my answer.

First of all, I don’t know the model you are using to generate the embeddings or the implementation, but still I hope the following can give some clarity.

Typically, any transformer model can process only a limited amount of sequence length, because the computation of Q,K,V matrices (for self-attention) goes up quadratically with increase in sequence length and require much resources to train.

That is why we just take some maximum value say 1024 or 512 for the tokens and pre-train the model. So the model can take in only that much. If your sequences are longer, you have no option but to truncate the sequence and process only upto the maximum the model can take in.

I think you issue is not of OOM (Out-of-Memory) but that of longer sequences than model can process. you may pass truncation=True parameter, in your tokenizer (if using) or preprocess the sequence upto the maximum length of model.

Thank you so much for your help!

I preprocessed the sequence without affecting the integrity and it worked like a charm!

Thank you for the suggestion!

1 Like

Happy to help. . Please mask as solved.