I want to create an input source for Causal Language model using Llama 2 model in hugging face. I have a set of documents which are scraped from a specific website and want to fine-tune on them. Each document its basically a different corner of this domain. Some documents can be very short while others (such as term and condition) large enough. The input size for Llama 2 model is 4096 tokens and only 2% of the documents are above this threshold.
I’ve seen that the most common strategy for language modelling using this tool is to concatenate context and split by eos token such as :
samples 1***: sentence 1 <eos> sentence 2 <eos> .... sentence n <eos> <pad as necessary> ***samples 2***: sentence n+1 <eos> sentence n+2 <eos> .... sentence n+j <eos> <pad as necessary>
Document 1: sentence 1. sentence 2...sentence i Document 2: sentence i+1. sentence i + 2...sentence i+k Document 3: sentence i+k+1. sentence i +k+ 2...sentence n
and so on
A sample can have sentences from different documents/topics and each sample size is up to the initial 4096 tokens
So, I’ve got the following questions:
- Should I split documents into different samples (and maybe decrease the total input token size down for 4096) or is it just alright to include multiple sentences concatenated with multiple EOS tokens in a single training sequence?
- The token, should i use to split context of different documents or should I use as in the example above marking the end of each distinct sentence?
- Can someone provide a heuristic on how to select the chucking parameter used by the model to split the input texts above?