CodeParrot retraining on custom dataset

Hi, I am trying to train CodeParrot on my own custom dataset which is based on Verilog language. I am using codeparrot tokenizer and codeparrot-small model. I am training by following the same approach as mentioned in this blog. My training dataset is quite small with only 4000 examples(1.4mb) with single-line codes. I have attached a sample of the dataset. My trained model is getting overfitted. When I try to give something outside of the training data then it doesn’t perform well. I have two assumptions

  • Dataset is too simple and synthetic
  • Model is quite big for such dataset

I am synthetically producing a dataset currently. Please suggest some appropriate solutions to avoid overfitting. Please find dataset here.

Code for training

from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig,AutoModelForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot")
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
config_kwargs = {"vocab_size": len(tokenizer),
                 "scale_attn_by_layer_idx": True,
                 "reorder_and_upcast_attn": True}


config = AutoConfig.from_pretrained("codeparrot/codeparrot-small", **config_kwargs)
config.n_layer=2
config.n_head = 5
config.n_embd = 400
model = AutoModelForCausalLM.from_config(config)


for code_example in train_dataset:
  temp = tokenizer(code_example,truncation=True,max_length=280,padding="max_length")['input_ids']
  train_dataset.append(temp)

class MyDataset(Dataset):
    def __init__(self, data):
        self.data = data
        
    def __getitem__(self, index):
        x = self.data[index]
        
        return torch.tensor(x)
    
    def __len__(self):
    
    	return len(self.data)

X_train = MyDataset(train_dataset)
dataloader = DataLoader(X_train, batch_size=32,shuffle=True)

optimizer = AdamW(get_grouped_params(model=model),lr=0.0001)
from tqdm import tqdm  # for our progress bar

epochs = 3

for epoch in range(epochs):
    # setup loop with TQDM and dataloader
    loop = tqdm(dataloader, leave=True)
    for batch in loop:

        optimizer.zero_grad()
        inputs = batch.to(device)
       

        outputs = model(inputs, labels=inputs, use_cache=False)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

pipe = pipeline('text-generation', model=model,tokenizer=tokenizer,device=0)
out = (pipe("module OR_8(Q,B,X);")[0]['generated_text'])