Training on my orgs dataset with GPT-4-ish as base

Our organization has about 270 documents that governs our dealings with just about any matter. They are in PDF and docx. On average maybe about 0.5 MB each. They are all well written, high quality and to the point. All information is in Swedish.

I would like to train a language model that can answer questions regarding how to work and act in different situations. I have noticed that GPT-4 speaks perfect Swedish,

  • The dream would be GPT-4-like performance on our unique dataset.
  • Paying for GPT-4 API or another pretrained language model is not a problem.
  • I would prefer our dataset to be kept “in house”.
  • I would also prefer that the algorithm trained on our material is kept in “in house”.
  • Initially I would like to use local hardware to train if possible. Of course, we would like to use a cloud service later.

How do I set up and train on the data in question? How much human interaction to grade output is necessary? Hardware can be acquired but I need to show some results if I want this to fly and get budget. What would be a good estimate on hardware needed for a crude trial? I´m sorry if this is a stupid or obvious question. Personally, I have rudimentary coding experience, but I will hire help if necessary.

Thank you all for your time!