Hi there! I am trying to get document embeddings using pre-trained models in the Transformer library. The input is a document, the output is an embedding for this document using a pre-trained model. But I got an error as below and don’t know how to fix it.
Code:
from transformers import pipeline, AutoTokenizer, AutoModel
from transformers import RobertaTokenizer, RobertaModel
import fitz
from openpyxl import load_workbook
import os
from tqdm import tqdm
PRETRAIN_MODEL = 'distilbert-base-cased'
DIR = "dataset"
# Load and process the text
all_files = os.listdir(DIR)
pdf_texts = {}
for filename in all_files:
if filename.lower().endswith('.pdf'):
pdf_path = os.path.join(DIR, filename)
with fitz.open(pdf_path) as doc:
text_content = ""
for page in doc:
text_content += page.get_text()
text = text_content.split("PUBLIC CONSULTATION")[0]
project_code = os.path.splitext(filename)[0]
pdf_texts[project_code] = text
# Generate embeddings for the documents
tokenizer = AutoTokenizer.from_pretrained(PRETRAIN_MODEL)
model = AutoModel.from_pretrained(PRETRAIN_MODEL)
pipe = pipeline('feature-extraction', model=model, tokenizer=tokenizer)
embeddings = {}
for project_code, text in tqdm(pdf_texts.items(), desc="Generating embeddings", unit="doc"):
embedding = pipe(text, return_tensors="pt")
embeddings[project_code] = embedding[0][0].numpy()
Error:
The error happens to the line embedding = pipe(text, return_tensors="pt")
. The output is as follows:
Generating embeddings: 0%| | 0/58 [00:00<?, ?doc/s]Token indices sequence length is longer than the specified maximum sequence length for this model (3619 > 512). Running this sequence through the model will result in indexing errors
Generating embeddings: 0%| | 0/58 [00:00<?, ?doc/s]
RuntimeError: The size of tensor a (3619) must match the size of tensor b (512) at non-singleton dimension 1
Library version:
- `transformers` version: 4.38.2
- Platform: Windows-10-10.0.19045-SP0
- Python version: 3.12.2
- Huggingface_hub version: 0.21.4
- Safetensors version: 0.4.2
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): 2.2.1 (True)
- Tensorflow version (GPU?): 2.16.1 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: <Not>
- Using distributed or parallel set-up in script?: <Not>
The input documents: dataset.zip - Google Drive
Thank you!