Extract data from html page and extract pre-structured JSON

valerebron · September 23, 2024, 3:15pm

Hi hugging face community !
I try to parse an html file and extract data to json.

For example, I want to crawll a web page listing events, such as this one “La programmation | Bataclan - Bataclan”
and search the html to find all gigs and generate a JSON structured like this:

{
 name: 'name of the event'
 date: 'a timestamp',
 url: 'url of event'
 artists: [
 {
   name: 'name of artist'
   style: 'style of artist'
   url: 'url of artist'
  }
]}

"
I’d appreciate some expert advice. Is the Text2TextGeneration pipeline the best model for this type of task?

John6666 · September 23, 2024, 3:54pm

First off, let me say that LLM is not my area of expertise. Not even an expert in anything.
The output from this space is markdown, but if you look at app.py, you’ll see that LLM is used. Other spaces, too, but it looks like something that could be diverted.

Topic		Replies	Views
Extract data from text and parse it as a JSON Beginners	6	23493	August 6, 2024
Recommend an AI model for structured (json) Beginners	1	8849	June 15, 2023
Converting web scrapped data to a semistructured json payload Models	1	352	April 13, 2024
Seeking assistance to extract specific information from the given prompt without generating new content Beginners	2	935	May 24, 2024
Text to structure: a way to standardize outputs 🤗Transformers	3	3746	July 21, 2024

Extract data from html page and extract pre-structured JSON

Related topics