Create samples for Causal Language Modelling

Dmts93 · August 16, 2023, 4:39pm

hello,

I want to create an input source for Causal Language model using Llama 2 model in hugging face. I have a set of documents which are scraped from a specific website and want to fine-tune on them. Each document its basically a different corner of this domain. Some documents can be very short while others (such as term and condition) large enough. The input size for Llama 2 model is 4096 tokens and only 2% of the documents are above this threshold.

I’ve seen that the most common strategy for language modelling using this tool is to concatenate context and split by eos token such as :

samples 1***: sentence 1 <eos> sentence 2 <eos> .... sentence n <eos> <pad as necessary> ***samples 2***: sentence n+1 <eos> sentence n+2 <eos> .... sentence n+j <eos> <pad as necessary>

Where

Document 1: sentence 1. sentence 2...sentence i
Document 2: sentence i+1. sentence i + 2...sentence i+k
Document 3: sentence i+k+1. sentence i +k+ 2...sentence n

and so on

A sample can have sentences from different documents/topics and each sample size is up to the initial 4096 tokens

So, I’ve got the following questions:

Should I split documents into different samples (and maybe decrease the total input token size down for 4096) or is it just alright to include multiple sentences concatenated with multiple EOS tokens in a single training sequence?
The token, should i use to split context of different documents or should I use as in the example above marking the end of each distinct sentence?
Can someone provide a heuristic on how to select the chucking parameter used by the model to split the input texts above?

shqqq · August 28, 2023, 1:18pm

Following this

Topic		Replies	Views
Data Preparation for CausalLM 🤗Transformers	1	1284	March 16, 2023
Causal language modeling documentation is wrong? 🤗Transformers	0	171	May 26, 2023
Text format for language modeling 🤗Transformers	5	2339	October 10, 2021
Token Chunking in Causal/Masked Language Modeling Course	0	854	November 7, 2023
How does GPT decide to stop generating sentences without EOS token? 🤗Transformers	13	24505	August 19, 2024

Create samples for Causal Language Modelling

Related topics