Best strategy for structured data extraction

Lodovico · August 26, 2023, 8:47am

Hi Everyone!

I am a Data Scientist working on a real estate search portal in Italy (similar to Zillow). We are now implementing free search. For the demo, the only goal was to take user input and populate a JSON file which would then serve as the search query. For this, I have parallelised several calls to chatGPT, each trained to extract 1 single feature.
It seems to me that for most of the binary features regex would outperform any model. I tried to add a classifier head to an italian bert: dbmdz/bert-base-italian-cased, but these did not perform extremely well. All I’ve found on the internet are classifications concerning sentiment based text, not concerning features. Does anyone have any Idea of what would be the best way to convert a free text search to a JSON like object for a query on a database?

KrissHolt · August 7, 2024, 12:09pm

For converting free text search to a JSON object for a database query, regex is very effective for binary features. You might consider a hybrid approach if you’re dealing with more complex features. You could use regex for simple, predictable patterns and a machine-learning model for more nuanced text extraction. Training a custom NER model on your specific dataset might help improve accuracy.

RhaegarShelby · August 7, 2024, 12:18pm

When extracting structured data, I think the best strategy is to start with understanding exactly what kind of data you need and where it’s coming from. This can save a lot of headaches down the line.

RhaegarShelby · August 12, 2024, 9:11am

When extracting structured data, I think the best strategy is to start with understanding exactly what kind of data you need and where it’s coming from. This can save a lot of headaches down the line. For me, tools like identity validation api have been convenient. They help ensure that the data is accurate and trustworthy, which is super important.
Once you’ve got your tools, set up a straightforward process. Break down the extraction into manageable steps, and test each to ensure it works correctly before moving on. This way, you can catch any issues early and fix them without hassle.

Topic		Replies	Views
Recommend an AI model for structured (json) Beginners	1	8727	June 15, 2023
Information extraction Research	0	473	July 26, 2023
Extract data from text and parse it as a JSON Beginners	6	23077	August 6, 2024
Text to structure: a way to standardize outputs 🤗Transformers	3	3608	July 21, 2024
Trying to choose a model for converting natural language to structured queries/output Beginners	0	463	December 5, 2023

Best strategy for structured data extraction

Related topics