Hi,
I’m trying to split text from companies’ Annual Reports into paragraphs for classification with various models. Parsing paragraphs seems a lot more challenging than parsing sentences, mostly due to the diverse organization of reports across companies. I’m currently using the code below but I struggle to filter out non-content elements like titles and legal information.
paragraphs = report[“content”].split(‘\n\n’)
def is_informative(paragraph):
paragraph = paragraph.strip()
if len(paragraph) < 40 or not any(c.isalpha() for c in paragraph):
return False
non_content_patterns = [
r’^\d+', r'continued on next page', r'^Table of Contents',
r'^See next page', r'^[\W_]+’
]
if any(re.search(pattern, paragraph, re.IGNORECASE) for pattern in non_content_patterns):
return False
return True
Any ideas on how to refine the approach? Cheers!