Parallelize model call for TFBertModel

Hi folks!
I am using a pretrained Bert (TFBertModel) in transformers to encode several batches of sentences, with varying batch size. That is, I need to use Bert to encode a series of inputs, where each input has [n_sentences, 512] dimensionality (where 512 is the number of tokens). N_sentences can vary between 2 and 250 across inputs/examples.

This is proving very time consuming: encoding each input/example takes several seconds, especially for larger values of n_sentences .

Is there a(n easy) way to parallelize the model(input) call (where, again, input has dimensionality [n_sentences, 512]) in Google Colab’s TPU (or on GPUs), such that more than one sentence is encoded at once?

I’m not sure what you are asking. “Parallellism” is a term that is often used to spread models or inputs across different devices (multiple CPUs, GPUs, and/or TPUs). It doesn’t seem that that is what you are after. Rather, you seem to look for batched processing where you process multiple sentences at once.

Then again, you say that you use batched inputs so that model(input) receives [n_sentences, 512] inputs. So you are already using batched data, effectively “encoding” multiple sentences at once. So again, I’m not sure what you are asking. Could you clarify?

Thank you for your reply, and sorry if unclear.
My question is indeed about parallelization.
I am asking if it is possible to distribute the computations for multiple sentences within a single batch across different cores or nodes - assuming this can speed things up and/or least reduce memory requirement.

Basically, the idea is to split the batch into “sub-batches” that can get distributed, and then re-assemble the outputs from each sub-batch. All this without changing the way the input is passed, I.e. without having to manually split the batches in advance, nor the output shape.
Hope it makes more sense.

I see. So you have predefined batches of a set size, and you cannot change that? That’s a bit… odd. I’m sure there is a way to do what you want, but it seems too complex to spend time in. It’d be better if you do not use predefined batches and instead use a distributedsampler and DDP that automatically takes care of parallellization of the input.