Text Length FinBert - Serverless Inference Endpoint

Hi guys, I’m trying to send a long text (longer than 512) to a FinBert inference endpoint deployed on a serveless inference endpoint on aws.
I’m receiving the following error: “The size of tensor a (639) must match the size of tensor b (512) at non-singleton dimension 1”.

I have a list of text that I would like to classify without splitting them, how can I fix?

Thank you in advance

The model has a max_sequence_length of 512. You can provide truncation=True as parameter, e.g.

{
  "inputs": "Hi, I recently bought a device from your company but it is not working as advertised and I would like to get reimbursed!",
  "parameters": {
   "truncation": True
  }
}

1 Like

Hi @philschmid
I am testing several pre-trained models that I find on the hub for Text-Classification. Many have a max_length=512. I deploy them to SageMaker Endpoint Serverless and invoke them from a lambda.

Months ago you suggested me to use the truncation parameter… Now… I was wondering…
if the text is longer, is it truncated? Do I then lose the information about the “excess” part or is each batch sentence evaluated and a result for the whole document processed?

Is there a way to define a preprocessing operation to chunk the sentence in order to get a better evaluation?

Something like:

tokenizer = BertTokenizer.from_pretrained(MODEL_ID)
tokens = tokenizer.encode_plus(text, add_special_tokens = False, return_tensors='pt')
input_id_chunks = tokens['input_ids'][0].split(510)
mask_chunks = tokens['attention_mask'][0].split(510)
chunksize = 512

input_id_chunks = list(input_id_chunks)
mask_chunks = list(mask_chunks)


for i in range(len(input_id_chunks)):
    input_id_chunks[i] = torch.cat([
        torch.Tensor([101]), input_id_chunks[i].float(), torch.Tensor([102])
    ])
    mask_chunks[i] = torch.cat([
        torch.Tensor([1]), mask_chunks[i].float(), torch.Tensor([1])
    ])
    
    pad_len = chunksize - input_id_chunks[i].shape[0]
    if pad_len > 0 :
        input_id_chunks[i] = torch.cat([
            input_id_chunks[i], torch.Tensor([0] * pad_len)
        ])
        mask_chunks[i] = torch.cat([
            mask_chunks[i], torch.Tensor([0] * pad_len)
        ])

Can I also know where I find the parameters that I can pass to my inference endpoint as input? Is there a resource that you can link?

Hi @thanksfinance. Yes, if you use the truncation parameter the text will be truncated and you will lose the “excess” part.

However, in text classification this is rarely a problem because the model is often able to determine the class with just using the first 512 tokens. Do you see a significant deterioration in your metrics when using the truncation parameter?

If yes, you might indeed want to do some preprocessing. I’m not entirely clear what your code does, but this thread discusses a similar issue and potential options how to solve this challenge.