Hi folks. First post here!
I’m trying to understand this paper on a Chinese adaptation of Llama and am curious how exactly they adapt it to the new language: [2304.08177] Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca
Here is a paragraph from the paper:
We initialize the Chinese LLaMA model with the original LLaMA weights and conduct pre-training using fp16 on the 7B and 13B models. Additionally, for the 33B model, we employ the bitsandbytes library to train it in an 8-bit format, enhancing its efficiency and memory usage. We directly apply LoRA to attentions and MLPs for training while setting the embeddings and LM head as trainable.
For the basic version of Chinese LLaMA-7B, we utilize a two-stage pre-training approach. In stage 1, we fix the parameters of the transformer encoders within the model and only train the embeddings, adapting the newly added Chinese word vectors while minimizing the disturbance to the original model. In stage 2, we add LoRA weights (adapters) to the attention mechanisms and train the embeddings, LM heads, and newly added LoRA parameters. Note that two-stage training is not applied to other model training as it is less efficient in our preliminary study.
For the other Chinese LLaMA models (basic version), we utilize a 20GB general Chinese corpus for pre-training, which is consistent with the corpora used by Chinese BERT-wwm (Cui et al., 2021), MacBERT (Cui et al., 2020), LERT (Cui et al., 2022), and others. We also provide “Plus” version, which further expands the pre-training data to 120GB, incorporating additional data from CommonCrawl (CC) and encyclopedia sources, enhancing the model’s understanding of fundamental concepts. We concatenate all the datasets and generated chunks of block size 512 for pre-training purposes.
First of all, if you initialize with the weights of a pre-trained model, then is what you do still called “pre-training” or is it actually fine-tuning? Secondly, from what I know LoRA is used for supervised fine-tuning but the authors apparently use it with unlabeled data. (They discuss instruction fine-tuning later.) Can someone elucidate what the authors are doing exactly here?