Training on my orgs dataset with GPT-4-ish as base

JoakimMyrberg · April 15, 2023, 4:39pm

Our organization has about 270 documents that governs our dealings with just about any matter. They are in PDF and docx. On average maybe about 0.5 MB each. They are all well written, high quality and to the point. All information is in Swedish.

I would like to train a language model that can answer questions regarding how to work and act in different situations. I have noticed that GPT-4 speaks perfect Swedish,

The dream would be GPT-4-like performance on our unique dataset.
Paying for GPT-4 API or another pretrained language model is not a problem.
I would prefer our dataset to be kept “in house”.
I would also prefer that the algorithm trained on our material is kept in “in house”.
Initially I would like to use local hardware to train if possible. Of course, we would like to use a cloud service later.

How do I set up and train on the data in question? How much human interaction to grade output is necessary? Hardware can be acquired but I need to show some results if I want this to fly and get budget. What would be a good estimate on hardware needed for a crude trial? I´m sorry if this is a stupid or obvious question. Personally, I have rudimentary coding experience, but I will hire help if necessary.

Thank you all for your time!

Topic		Replies	Views
Fine-tune, or train from scratch? Beginners	6	3465	September 16, 2020
PreTrain GPT-2 from scratch for German on novel GC4 dataset Flax/JAX Projects	7	1201	July 2, 2021
Pretrain gpt2 example Beginners	0	305	June 11, 2021
How to train a gpt2 with colab pro Models	16	3720	February 29, 2024
Best free options if you want to train a language model on a small set of private documents? Beginners	3	450	April 5, 2024

Training on my orgs dataset with GPT-4-ish as base

Related topics