Repost: Wikipedia (or something else) text to input output

Bigfoot302 · November 17, 2024, 4:35pm

Repost because my original post is being hide by bots:
Hey, I’m new to topics related to LLMs. I’m not sure if this is the right category, but I’ve come across many datasets containing texts from sources like Wikipedia, the web, books, and documentation. However, I’m unsure how to train a chatbot model using text that doesn’t include question-and-answer pairs.

Additionally, I’m curious about how companies like OpenAI and Meta train their models using text from the internet.

One more question (I know I’m asking a lot): how can I speed up training a text classifier model on a dataset with 441k rows?

Apologies for any mistakes; English isn’t my native language (I speak French).

John6666 · November 18, 2024, 1:56am

That bot thing was a disaster.
If you’re not sure about the category, it’s fine to use ‘Beginners’. I’ve never really paid attention to the category…
I don’t know the details of the LLM training, so I’ll get someone else. @not-lain You’re probably the right person for this question.

not-lain · November 18, 2024, 12:59pm

Hi @Bigfoot302
this blogpost might interest you :

Additionally, I’m curious about how companies like OpenAI and Meta train their models using text from the internet.

these companies train their AI models on 2 to 3 steps :

pretraining : feeding the AI vast amounts of text data (ie wikipedia, math, poems, facts, …) just to teach the AI how to talk and some common general knowledge
instruction tuning : in here we examples of conversations to the AI to teach it how to reply to instructions or conversations, this also helps to teach the AI when to stop talking instead of generating a long paragraph (example when I say Hi i expect the AI to say Hello instead of creating an wikipedia like reply, a bad example of a reply without finetuning is the following Hi is a word people use to greet each other ...
preferance optimization and further enhancements

One more question (I know I’m asking a lot): how can I speed up training a text classifier model on a dataset with 441k rows?

I would say get a big GPU and increase the batch size, else get multiple GPUs and setup multiple GPU parallelism, the second method is a bit complex, so try to research more about it in advance or use the trainer API (already used in the blogpost above) which comes with multiple GPU support by default.

Hope this helps

Bigfoot302 · November 18, 2024, 5:47pm

Thank you for the reply.

How can I “feed” a model with text? As far as I know, when training a model, I need input_ids and attention_mask for the inputs and labels for the outputs. However, I don’t see how it’s possible to train a model using just text.

Edit: I see that if I leave the input blank and put the text I want in the output, it works.

Topic		Replies	Views
Wikipedia (or something else) text to input output Beginners	0	327	November 15, 2024
Fine-Tuning + RAG based Chatbot: Dataset Structure & Instruction Adherence Issues Intermediate	7	396	March 11, 2025
How to transition from linguistic prompt engineering to NLP/ML/FT Beginners	1	588	November 1, 2024
Looking for a solution on training my own LLM Beginners	2	2265	April 29, 2024
Unbiased chat or LLM required. Or even a model that I can retrain without using code. Please Beginners	0	361	April 21, 2024

Repost: Wikipedia (or something else) text to input output

Related topics