Tensor size error when generating embeddings for documents using pre-trained models

hliuci · April 3, 2024, 11:34am

Hi there! I am trying to get document embeddings using pre-trained models in the Transformer library. The input is a document, the output is an embedding for this document using a pre-trained model. But I got an error as below and don’t know how to fix it.
Code:

from transformers import pipeline, AutoTokenizer, AutoModel
from transformers import RobertaTokenizer, RobertaModel
import fitz
from openpyxl import load_workbook
import os
from tqdm import tqdm

PRETRAIN_MODEL = 'distilbert-base-cased'
DIR = "dataset"

# Load and process the text
all_files = os.listdir(DIR)
pdf_texts = {}
for filename in all_files:
    if filename.lower().endswith('.pdf'):
        pdf_path = os.path.join(DIR, filename)
        with fitz.open(pdf_path) as doc:
            text_content = ""
            for page in doc:
                text_content += page.get_text()
            text = text_content.split("PUBLIC CONSULTATION")[0]
            project_code = os.path.splitext(filename)[0]
            pdf_texts[project_code] = text 

# Generate embeddings for the documents
tokenizer = AutoTokenizer.from_pretrained(PRETRAIN_MODEL)
model = AutoModel.from_pretrained(PRETRAIN_MODEL)
pipe = pipeline('feature-extraction', model=model, tokenizer=tokenizer)

embeddings = {}
for project_code, text in tqdm(pdf_texts.items(), desc="Generating embeddings", unit="doc"):
    embedding = pipe(text, return_tensors="pt")
    embeddings[project_code] = embedding[0][0].numpy()

Error:
The error happens to the line embedding = pipe(text, return_tensors="pt"). The output is as follows:

Generating embeddings:   0%|          | 0/58 [00:00<?, ?doc/s]Token indices sequence length is longer than the specified maximum sequence length for this model (3619 > 512). Running this sequence through the model will result in indexing errors
Generating embeddings:   0%|          | 0/58 [00:00<?, ?doc/s]
RuntimeError: The size of tensor a (3619) must match the size of tensor b (512) at non-singleton dimension 1

Library version:

- `transformers` version: 4.38.2
- Platform: Windows-10-10.0.19045-SP0
- Python version: 3.12.2
- Huggingface_hub version: 0.21.4
- Safetensors version: 0.4.2
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): 2.2.1 (True)
- Tensorflow version (GPU?): 2.16.1 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <Not>
- Using distributed or parallel set-up in script?: <Not>

The input documents: dataset.zip - Google Drive

Thank you!

swtb · April 4, 2024, 9:17am

The embedding of your data is larger than the size of embedding that the model supports.

Imagine adding together the vectors (x, y, z) with (a, b, c, d, e). 3619 is a weird embedding length to me. My suspicion is that this is related to the “feature-extraction” task you have selected for your pipeline.

From the docs


extractor = pipeline(model="google-bert/bert-base-uncased", task="feature-extraction")
result = extractor("This is a simple test.", return_tensors=True)
result.shape  # This is a tensor of shape [1, sequence_lenth, hidden_dimension] representing the input string.

hliuci · April 11, 2024, 10:46am

Thanks @swtb .You are right, the model only supports the length of 512 tokens. My solution is to use other models that support longer length.

system · April 12, 2024, 9:19am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Extracting token embeddings from pretrained language models Beginners	9	22205	May 2, 2024
Getting pretrained embeddings 🤗Transformers	0	599	June 20, 2023
Index out of range in transformer summarization 🤗Transformers	2	132	December 16, 2024
Mistral model generates the same embeddings for different input texts 🤗Transformers	2	343	April 12, 2024
Facing Tensor Size issue in Tranformer tool Beginners	3	740	October 13, 2023

Tensor size error when generating embeddings for documents using pre-trained models

Related topics