Hello Hugging Face Community,
I’m working on a project that involves extracting company names and their bid amounts from diverse public procurement announcements, with the aim to organize this information into a structured format like this: {'announcement_id': xxx, bids: [{'company': 'A', 'bid_value': 500}, {'company': 'B', 'bid_value': 600}]}
.
Each announcement can include none or multiple companies and their bids. This structured information would very much help me with my research project. I’ve manually prepared a training dataset from a selection of these announcements. However, I’m quite new to NLP and I am unsure about the best methodologies to employ.
Could you please point me to a good direction and offer advice or resources on the following points?
- Model Choice: Should I use Named Entity Recognition (NER), a Generative Model, or another approach for this task?
- Data Preparation: What are best practices for preparing and structuring my data to handle multiple entries per document?
- Model Training: How can I effectively fine-tune a model to recognize and extract this specific type of data? (Using my manually extracted sample of companies and bids from announcements).
I would greatly appreciate any insights or examples of similar projects, as I’m not sure which aspects to focus on more deeply.
Thank you!