HTML SEC Filings Parser

Abhi-bit · May 13, 2023, 11:43pm

Wanted to ask what the best way to approach this task would be. Essentially, I want to convert SEC filings (ex. AAPL 10k), which are large HTML files, into a hierarchical JSON format.

The use case is to make it easier to read the filing by converting the following html:

Into a JSON like the following structure (but HTML rather than text):

Part 1

Item 1. Business
- Company Background
  - The Company designs, manufactures, …
Products

Would there be a way to essentially loop through every html element, and then compare it with the previous in a comparison function that would decide where to add the element in the structure? This function can be pre trained with data from manually doing this process.

Was curious if this is a feasible solution, creating the training set is obviously feasible, but is it cost effective it run this over the few thousand html elements in a filing? Are there any limits with the size of each html element? Also, can the text classifier HF model classify HTML as well? Finally, is there a better way to go about this?

Thanks!

Topic		Replies	Views
Extract data from html page and extract pre-structured JSON 🤗Transformers	1	669	September 23, 2024
Recommend an AI model for structured (json) Beginners	1	9040	June 15, 2023
Computer language translation/classification Beginners	0	239	July 1, 2021
Training a model to add HTML formatting to a web article? Beginners	0	470	June 17, 2021
How to preserve Html when processing(paraphrasing) 🤗Transformers	3	406	November 21, 2021

HTML SEC Filings Parser

Related topics