CodeParrot retraining on custom dataset

Hi, I am trying to train CodeParrot on my own custom dataset which is based on Verilog language. I am using codeparrot tokenizer and codeparrot-small model. I am training by following the same approach as mentioned in this blog. My training dataset is quite small with only 4000 examples(1.4mb) with single-line codes. I have attached a sample of the dataset. My trained model is getting overfitted. When I try to give something outside of the training data then it doesn’t perform well. I have two assumptions

  • Dataset is too simple and synthetic
  • Model is quite big for such dataset

I am synthetically producing a dataset currently. Please suggest some appropriate solutions to avoid overfitting. Please find dataset here.

Code for training

from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig,AutoModelForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot")
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
config_kwargs = {"vocab_size": len(tokenizer),
                 "scale_attn_by_layer_idx": True,
                 "reorder_and_upcast_attn": True}

config = AutoConfig.from_pretrained("codeparrot/codeparrot-small", **config_kwargs)
config.n_head = 5
config.n_embd = 400
model = AutoModelForCausalLM.from_config(config)

for code_example in train_dataset:
  temp = tokenizer(code_example,truncation=True,max_length=280,padding="max_length")['input_ids']

class MyDataset(Dataset):
    def __init__(self, data): = data
    def __getitem__(self, index):
        x =[index]
        return torch.tensor(x)
    def __len__(self):
    	return len(

X_train = MyDataset(train_dataset)
dataloader = DataLoader(X_train, batch_size=32,shuffle=True)

optimizer = AdamW(get_grouped_params(model=model),lr=0.0001)
from tqdm import tqdm  # for our progress bar

epochs = 3

for epoch in range(epochs):
    # setup loop with TQDM and dataloader
    loop = tqdm(dataloader, leave=True)
    for batch in loop:

        inputs =

        outputs = model(inputs, labels=inputs, use_cache=False)
        loss = outputs.loss
        loop.set_description(f'Epoch {epoch}')

pipe = pipeline('text-generation', model=model,tokenizer=tokenizer,device=0)
out = (pipe("module OR_8(Q,B,X);")[0]['generated_text'])