Split text from Annual Report pdfs into paragraphs

Hi,

I’m trying to split text from companies’ Annual Reports into paragraphs for classification with various models. Parsing paragraphs seems a lot more challenging than parsing sentences, mostly due to the diverse organization of reports across companies. I’m currently using the code below but I struggle to filter out non-content elements like titles and legal information.

paragraphs = report[“content”].split(‘\n\n’)

def is_informative(paragraph):
paragraph = paragraph.strip()
if len(paragraph) < 40 or not any(c.isalpha() for c in paragraph):
return False
non_content_patterns = [
r’^\d+', r'continued on next page', r'^Table of Contents', r'^See next page', r'^[\W_]+
]
if any(re.search(pattern, paragraph, re.IGNORECASE) for pattern in non_content_patterns):
return False
return True

Any ideas on how to refine the approach? Cheers!

It would also be great to hear how others use filters to clean sentences extracted from pdfs, in particular in the context of mandatory corporate documents such as Annual Reports. Thanks again!