from langchain.document_loaders import PyMuPDFLoader
from langchain_experimental.text_splitter import SemanticChunker
from langchain.chains import load_summarize_chain
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from langchain_huggingface import HuggingFacePipeline
import uuid
from langchain.schema.document import Document
loader = PyMuPDFLoader('30006389_BLE_Soft copy.pdf')
doc = loader.load()
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
splitter = SemanticChunker(buffer_size=1, breakpoint_threshold_type='percentile', embeddings=embeddings)
doc_splitted = splitter.split_documents(doc)
model_name = "facebook/bart-large-cnn"
tokenizer = AutoTokenizer.from_pretrained(model_name, clean_up_tokenization_spaces = True)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
summaries_list = []
for each_Doc in doc_splitted :
unique_id = str(uuid.uuid4())
summary = pipe(each_Doc.page_content, truncation = True, max_length = 200, no_repeat_ngram_size=5)
summary_Document = Document(page_content = summary[0]['generated_text'],
metadata = {"summary_id" : unique_id})
each_Doc.metadata["summary_id"] = unique_id
summaries_list.append(summary_Document)
I get the following error :
IndexError: index out of range in self
Why am I getting this error and could you please alter the given code to resolve the error
I tried many fixes like
1) model.resize_token_embeddings(len(tokenizer)) - As mentioned in huggingface discussion forum
2) load_summarize_chain()```
Could anyone help me ?
Are you forgetting this?
from langchain_huggingface import HuggingFacePipeline, HuggingFaceEmbeddings
When you get that error, it is often because the data itself has not been retrieved and is empty, or because you have written [ or ( in the wrong way.
P.S.
I found it.
Import is NOT the problem. And all the ( and [ are correct. The problem is I cant use pipeline() function as such.
I see. That would be tricky if it was because of this…
The header part of README.md is also a configuration file, so changing it may change the features available in the Serverless Inference API.
If you download the whole model and use it, it is basically unaffected.
pipeline_tag: summarization
Does the mean that If I use the model by importing them from transformers library, some changes could happen to README.md file ??? If yes, why and what should be done in that case ?
Actually, inside the — of the README.md file is a YAML configuration file, so the behavior of the model may change depending on changes.
The outside, on the other hand, is just a README.
But looking again at your program, there should be few cases where this style of calling causes problems…
The problem with README.md is usually caused by calling the model with the API.
However, if you are getting errors related to self, it must be a bug in some Python class, so I still suspect the model and the functions for it and these options.
Anyway, in these cases, it helps to try another similar model to isolate the cause of the problem.
Possible workarounds will be known if they exist once the location of the bug is known.
Sure. Will look into it. Thanks