Index Error while Summarizing splitted Documents

Robin19 · October 1, 2024, 2:45pm

from langchain.document_loaders import PyMuPDFLoader
from langchain_experimental.text_splitter import SemanticChunker
from langchain.chains import load_summarize_chain
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from langchain_huggingface import HuggingFacePipeline
import uuid
from langchain.schema.document import Document


loader = PyMuPDFLoader('30006389_BLE_Soft copy.pdf')
doc = loader.load()


embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

splitter = SemanticChunker(buffer_size=1, breakpoint_threshold_type='percentile', embeddings=embeddings)
doc_splitted = splitter.split_documents(doc)


model_name = "facebook/bart-large-cnn"
tokenizer = AutoTokenizer.from_pretrained(model_name, clean_up_tokenization_spaces = True)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)

summaries_list = []
for each_Doc in doc_splitted :
    unique_id = str(uuid.uuid4())
    summary = pipe(each_Doc.page_content, truncation = True, max_length = 200, no_repeat_ngram_size=5)
    summary_Document = Document(page_content = summary[0]['generated_text'],
                            metadata = {"summary_id" : unique_id})
    each_Doc.metadata["summary_id"] = unique_id
    summaries_list.append(summary_Document)

I get the following error : 

IndexError: index out of range in self

Why am I getting this error and could you please alter the given code to resolve the error

I tried many fixes like 
1) model.resize_token_embeddings(len(tokenizer)) - As mentioned in huggingface discussion forum
2) load_summarize_chain()```

Robin19 · October 1, 2024, 6:30pm

Could anyone help me ?

John6666 · October 1, 2024, 8:51pm

Are you forgetting this?

from langchain_huggingface import HuggingFacePipeline, HuggingFaceEmbeddings

When you get that error, it is often because the data itself has not been retrieved and is empty, or because you have written [ or ( in the wrong way.

P.S.
I found it.

Robin19 · October 2, 2024, 4:47pm

Import is NOT the problem. And all the ( and [ are correct. The problem is I cant use pipeline() function as such.

John6666 · October 2, 2024, 10:13pm

I see. That would be tricky if it was because of this…
The header part of README.md is also a configuration file, so changing it may change the features available in the Serverless Inference API.
If you download the whole model and use it, it is basically unaffected.

pipeline_tag: summarization

Robin19 · October 6, 2024, 4:24pm

Does the mean that If I use the model by importing them from transformers library, some changes could happen to README.md file ??? If yes, why and what should be done in that case ?

John6666 · October 6, 2024, 11:10pm

Actually, inside the — of the README.md file is a YAML configuration file, so the behavior of the model may change depending on changes.
The outside, on the other hand, is just a README.

But looking again at your program, there should be few cases where this style of calling causes problems…
The problem with README.md is usually caused by calling the model with the API.
However, if you are getting errors related to self, it must be a bug in some Python class, so I still suspect the model and the functions for it and these options.

Anyway, in these cases, it helps to try another similar model to isolate the cause of the problem.
Possible workarounds will be known if they exist once the location of the bug is known.

Robin19 · October 9, 2024, 5:55am

Sure. Will look into it. Thanks

Topic		Replies	Views
Index out of range in transformer summarization 🤗Transformers	2	117	December 16, 2024
[HELP] How to fix IndexError: index out of range in self Beginners	1	1550	March 31, 2023
IndexError: index out of range in self while training a language model from scratch 🤗Transformers	0	298	April 9, 2024
Adding New Tokens - IndexError: index out of range in self Beginners	5	2695	June 17, 2021
Out of index error when using pre-trained Pegasus model Intermediate	2	1991	April 1, 2021

Index Error while Summarizing splitted Documents

Related topics