Seeking Guidance on Extracting Bidding Data from Procurement Documents

Hello Hugging Face Community,

I’m working on a project that involves extracting company names and their bid amounts from diverse public procurement announcements, with the aim to organize this information into a structured format like this: {'announcement_id': xxx, bids: [{'company': 'A', 'bid_value': 500}, {'company': 'B', 'bid_value': 600}]}.

Each announcement can include none or multiple companies and their bids. This structured information would very much help me with my research project. I’ve manually prepared a training dataset from a selection of these announcements. However, I’m quite new to NLP and I am unsure about the best methodologies to employ.

Could you please point me to a good direction and offer advice or resources on the following points?

  1. Model Choice: Should I use Named Entity Recognition (NER), a Generative Model, or another approach for this task?
  2. Data Preparation: What are best practices for preparing and structuring my data to handle multiple entries per document?
  3. Model Training: How can I effectively fine-tune a model to recognize and extract this specific type of data? (Using my manually extracted sample of companies and bids from announcements).

I would greatly appreciate any insights or examples of similar projects, as I’m not sure which aspects to focus on more deeply.

Thank you!