Wanted to ask what the best way to approach this task would be. Essentially, I want to convert SEC filings (ex. AAPL 10k), which are large HTML files, into a hierarchical JSON format.
The use case is to make it easier to read the filing by converting the following html:
Into a JSON like the following structure (but HTML rather than text):
Item 1. Business
- The Company designs, manufactures, …
Would there be a way to essentially loop through every html element, and then compare it with the previous in a comparison function that would decide where to add the element in the structure? This function can be pre trained with data from manually doing this process.
Was curious if this is a feasible solution, creating the training set is obviously feasible, but is it cost effective it run this over the few thousand html elements in a filing? Are there any limits with the size of each html element? Also, can the text classifier HF model classify HTML as well? Finally, is there a better way to go about this?