Tensor size error when generating embeddings for documents using pre-trained models

Hi there! I am trying to get document embeddings using pre-trained models in the Transformer library. The input is a document, the output is an embedding for this document using a pre-trained model. But I got an error as below and don’t know how to fix it.

from transformers import pipeline, AutoTokenizer, AutoModel
from transformers import RobertaTokenizer, RobertaModel
import fitz
from openpyxl import load_workbook
import os
from tqdm import tqdm

PRETRAIN_MODEL = 'distilbert-base-cased'
DIR = "dataset"

# Load and process the text
all_files = os.listdir(DIR)
pdf_texts = {}
for filename in all_files:
    if filename.lower().endswith('.pdf'):
        pdf_path = os.path.join(DIR, filename)
        with fitz.open(pdf_path) as doc:
            text_content = ""
            for page in doc:
                text_content += page.get_text()
            text = text_content.split("PUBLIC CONSULTATION")[0]
            project_code = os.path.splitext(filename)[0]
            pdf_texts[project_code] = text 

# Generate embeddings for the documents
tokenizer = AutoTokenizer.from_pretrained(PRETRAIN_MODEL)
model = AutoModel.from_pretrained(PRETRAIN_MODEL)
pipe = pipeline('feature-extraction', model=model, tokenizer=tokenizer)

embeddings = {}
for project_code, text in tqdm(pdf_texts.items(), desc="Generating embeddings", unit="doc"):
    embedding = pipe(text, return_tensors="pt")
    embeddings[project_code] = embedding[0][0].numpy()

The error happens to the line embedding = pipe(text, return_tensors="pt"). The output is as follows:

Generating embeddings:   0%|          | 0/58 [00:00<?, ?doc/s]Token indices sequence length is longer than the specified maximum sequence length for this model (3619 > 512). Running this sequence through the model will result in indexing errors
Generating embeddings:   0%|          | 0/58 [00:00<?, ?doc/s]
RuntimeError: The size of tensor a (3619) must match the size of tensor b (512) at non-singleton dimension 1

Library version:

- `transformers` version: 4.38.2
- Platform: Windows-10-10.0.19045-SP0
- Python version: 3.12.2
- Huggingface_hub version: 0.21.4
- Safetensors version: 0.4.2
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): 2.2.1 (True)
- Tensorflow version (GPU?): 2.16.1 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <Not>
- Using distributed or parallel set-up in script?: <Not>

The input documents: dataset.zip - Google Drive

Thank you!

The embedding of your data is larger than the size of embedding that the model supports.

Imagine adding together the vectors (x, y, z) with (a, b, c, d, e). 3619 is a weird embedding length to me. My suspicion is that this is related to the “feature-extraction” task you have selected for your pipeline.

From the docs

extractor = pipeline(model="google-bert/bert-base-uncased", task="feature-extraction")
result = extractor("This is a simple test.", return_tensors=True)
result.shape  # This is a tensor of shape [1, sequence_lenth, hidden_dimension] representing the input string.

1 Like

Thanks @swtb .You are right, the model only supports the length of 512 tokens. My solution is to use other models that support longer length.

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.