Repost: Wikipedia (or something else) text to input output

Hi @Bigfoot302
this blogpost might interest you :

Additionally, I’m curious about how companies like OpenAI and Meta train their models using text from the internet.

these companies train their AI models on 2 to 3 steps :

  1. pretraining : feeding the AI vast amounts of text data (ie wikipedia, math, poems, facts, …) just to teach the AI how to talk and some common general knowledge
  2. instruction tuning : in here we examples of conversations to the AI to teach it how to reply to instructions or conversations, this also helps to teach the AI when to stop talking instead of generating a long paragraph (example when I say Hi i expect the AI to say Hello instead of creating an wikipedia like reply, a bad example of a reply without finetuning is the following Hi is a word people use to greet each other ...
  3. preferance optimization and further enhancements

One more question (I know I’m asking a lot): how can I speed up training a text classifier model on a dataset with 441k rows?

I would say get a big GPU and increase the batch size, else get multiple GPUs and setup multiple GPU parallelism, the second method is a bit complex, so try to research more about it in advance or use the trainer API (already used in the blogpost above) which comes with multiple GPU support by default.

Hope this helps

2 Likes