Hi, I am trying to train CodeParrot on my own custom dataset which is based on Verilog language. I am using codeparrot tokenizer and codeparrot-small model. I am training by following the same approach as mentioned in this blog. My training dataset is quite small with only 4000 examples(1.4mb) with single-line codes. I have attached a sample of the dataset. My trained model is getting overfitted. When I try to give something outside of the training data then it doesn’t perform well. I have two assumptions
- Dataset is too simple and synthetic
- Model is quite big for such dataset
I am synthetically producing a dataset currently. Please suggest some appropriate solutions to avoid overfitting. Please find dataset here.
Code for training
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig,AutoModelForCausalLM
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot")
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
config_kwargs = {"vocab_size": len(tokenizer),
"scale_attn_by_layer_idx": True,
"reorder_and_upcast_attn": True}
config = AutoConfig.from_pretrained("codeparrot/codeparrot-small", **config_kwargs)
config.n_layer=2
config.n_head = 5
config.n_embd = 400
model = AutoModelForCausalLM.from_config(config)
for code_example in train_dataset:
temp = tokenizer(code_example,truncation=True,max_length=280,padding="max_length")['input_ids']
train_dataset.append(temp)
class MyDataset(Dataset):
def __init__(self, data):
self.data = data
def __getitem__(self, index):
x = self.data[index]
return torch.tensor(x)
def __len__(self):
return len(self.data)
X_train = MyDataset(train_dataset)
dataloader = DataLoader(X_train, batch_size=32,shuffle=True)
optimizer = AdamW(get_grouped_params(model=model),lr=0.0001)
from tqdm import tqdm # for our progress bar
epochs = 3
for epoch in range(epochs):
# setup loop with TQDM and dataloader
loop = tqdm(dataloader, leave=True)
for batch in loop:
optimizer.zero_grad()
inputs = batch.to(device)
outputs = model(inputs, labels=inputs, use_cache=False)
loss = outputs.loss
loss.backward()
optimizer.step()
loop.set_description(f'Epoch {epoch}')
loop.set_postfix(loss=loss.item())
pipe = pipeline('text-generation', model=model,tokenizer=tokenizer,device=0)
out = (pipe("module OR_8(Q,B,X);")[0]['generated_text'])