Hi!
I’m trying to use CodeGen 350m Mono for transfer learning. However, I don’t understand how the CodeGen’s tokenizer works.
Here’s my code:
test_data = datasets.load_dataset(“codeparrot/apps”, “all”, split=“test”)
train_data = datasets.load_dataset(“codeparrot/apps”, “all”, split=“train”)
checkpoint = “Salesforce/codegen-350M-mono”
tokenizer = CodeGenTokenizer.from_pretrained(checkpoint)
model = CodeGenModel.from_pretrained(“Salesforce/codegen-350M-mono”)
model.save_pretrained(‘./model’)
data_train = test_data.to_pandas()
data_train = pd.DataFrame(data_train)
data_test = train_data.to_pandas()
data_test = pd.DataFrame(data_test)
train_sentences =
train_labels =
test_sentences =
test_labels =
for data in data_train[[‘question’, ‘solutions’, ‘input_output’, ‘difficulty’]]:
train_sentences.append(data_train[‘question’])
train_sentences.append(data_train[‘input_output’])
train_sentences.append(data_train[‘difficulty’])
train_labels.append(data_train[‘solutions’])
for data in data_test[[‘question’, ‘solutions’, ‘input_output’, ‘difficulty’]]:
test_sentences.append(data_test[‘question’])
test_sentences.append(data_test[‘input_output’])
test_sentences.append(data_test[‘difficulty’])
test_labels.append(data_test[‘solutions’])
I would like to tokenize those lists in order to train CodeGen adding new layers (something similar to ‘tokenizer.texts_to_sequences’ and ‘pad_sequences’.
Could someone help me?