Repost: Wikipedia (or something else) text to input output

Repost because my original post is being hide by bots:
Hey, I’m new to topics related to LLMs. I’m not sure if this is the right category, but I’ve come across many datasets containing texts from sources like Wikipedia, the web, books, and documentation. However, I’m unsure how to train a chatbot model using text that doesn’t include question-and-answer pairs.

Additionally, I’m curious about how companies like OpenAI and Meta train their models using text from the internet.

One more question (I know I’m asking a lot): how can I speed up training a text classifier model on a dataset with 441k rows?

Apologies for any mistakes; English isn’t my native language (I speak French).

1 Like

That bot thing was a disaster.
If you’re not sure about the category, it’s fine to use ‘Beginners’. I’ve never really paid attention to the category…
I don’t know the details of the LLM training, so I’ll get someone else. @not-lain You’re probably the right person for this question.

1 Like

Hi @Bigfoot302
this blogpost might interest you :

Additionally, I’m curious about how companies like OpenAI and Meta train their models using text from the internet.

these companies train their AI models on 2 to 3 steps :

  1. pretraining : feeding the AI vast amounts of text data (ie wikipedia, math, poems, facts, …) just to teach the AI how to talk and some common general knowledge
  2. instruction tuning : in here we examples of conversations to the AI to teach it how to reply to instructions or conversations, this also helps to teach the AI when to stop talking instead of generating a long paragraph (example when I say Hi i expect the AI to say Hello instead of creating an wikipedia like reply, a bad example of a reply without finetuning is the following Hi is a word people use to greet each other ...
  3. preferance optimization and further enhancements

One more question (I know I’m asking a lot): how can I speed up training a text classifier model on a dataset with 441k rows?

I would say get a big GPU and increase the batch size, else get multiple GPUs and setup multiple GPU parallelism, the second method is a bit complex, so try to research more about it in advance or use the trainer API (already used in the blogpost above) which comes with multiple GPU support by default.

Hope this helps

2 Likes

Thank you for the reply.

How can I “feed” a model with text? As far as I know, when training a model, I need input_ids and attention_mask for the inputs and labels for the outputs. However, I don’t see how it’s possible to train a model using just text.

Edit: I see that if I leave the input blank and put the text I want in the output, it works.