How to use CodeGen

laryssa · August 1, 2022, 8:05pm

Hi!

I’m trying to use CodeGen 350m Mono for transfer learning. However, I don’t understand how the CodeGen’s tokenizer works.

Here’s my code:

test_data = datasets.load_dataset(“codeparrot/apps”, “all”, split=“test”)
train_data = datasets.load_dataset(“codeparrot/apps”, “all”, split=“train”)

checkpoint = “Salesforce/codegen-350M-mono”
tokenizer = CodeGenTokenizer.from_pretrained(checkpoint)

model = CodeGenModel.from_pretrained(“Salesforce/codegen-350M-mono”)
model.save_pretrained(‘./model’)

data_train = test_data.to_pandas()
data_train = pd.DataFrame(data_train)
data_test = train_data.to_pandas()
data_test = pd.DataFrame(data_test)

train_sentences =
train_labels =

test_sentences =
test_labels =

for data in data_train[[‘question’, ‘solutions’, ‘input_output’, ‘difficulty’]]:
train_sentences.append(data_train[‘question’])
train_sentences.append(data_train[‘input_output’])
train_sentences.append(data_train[‘difficulty’])
train_labels.append(data_train[‘solutions’])

for data in data_test[[‘question’, ‘solutions’, ‘input_output’, ‘difficulty’]]:
test_sentences.append(data_test[‘question’])
test_sentences.append(data_test[‘input_output’])
test_sentences.append(data_test[‘difficulty’])
test_labels.append(data_test[‘solutions’])

I would like to tokenize those lists in order to train CodeGen adding new layers (something similar to ‘tokenizer.texts_to_sequences’ and ‘pad_sequences’.

Could someone help me?

rwheel · August 4, 2022, 9:35am

Hi!

There are different ways to solve your problem, but I recommend using the Dataset library, it allows you to tokenize the whole corpus in an easy way.
Note that in the code below I read both train/test data at the same time. Then you can get each of them as a dict (dataset[“train”], dataset[“test”]).

Here is a summary of the steps:

Read the data with datasets library.
Define a preprocess function to tokenize your data in your way. Due to you want to get the data in two different lists (sentences that contains question, input_output and difficulty and labels that contains solutions) I defined two different functions. The function is quite easy, it only join the desired data and tokenize it.
Tokenize all data using map function from datasets.

import transformers
from datasets import load_dataset
from transformers import AutoTokenizer

dataset = load_dataset("codeparrot/apps", "all")

def preprocess_function_sentences(data):
    return tokenizer([x + y + z + " " for x, y, z in zip(data["question"], data["input_output"], data["difficulty"])], truncation=True, max_length=128)

def preprocess_function_label(data):
    return tokenizer([" ".join(x) for x in data["solutions"]], truncation=True, max_length=128)

tokenized_dataset_sentence = dataset.map(preprocess_function_sentences,
                                 batched=True,
                                 num_proc=4,
                                 remove_columns=dataset["train"].column_names)

tokenized_dataset_label = dataset.map(preprocess_function_label,
                                 batched=True,
                                 num_proc=4,
                                 remove_columns=dataset["train"].column_names)

Hope this can help you!

PS: In the docs you can find more detailed information (Process text data).

Cheers,
Ramón

Topic	Replies	Views
CodeGen Model - Transfer Learning, Train and Eval (codeparrot/apps database) Beginners	539	August 7, 2022
How can I use tokenized Dataset for Text Generation? Beginners	497	January 22, 2023
Converting logits to string without .generate() Beginners	631	February 13, 2023
Pretrain a model on a very specific language for NER Beginners	372	September 28, 2023
Help defining tokenizer 🤗Tokenizers	282	April 28, 2023

How to use CodeGen

for data in data_test[[‘question’, ‘solutions’, ‘input_output’, ‘difficulty’]]: test_sentences.append(data_test[‘question’]) test_sentences.append(data_test[‘input_output’]) test_sentences.append(data_test[‘difficulty’]) test_labels.append(data_test[‘solutions’])

Related topics

for data in data_test[[‘question’, ‘solutions’, ‘input_output’, ‘difficulty’]]:
test_sentences.append(data_test[‘question’])
test_sentences.append(data_test[‘input_output’])
test_sentences.append(data_test[‘difficulty’])
test_labels.append(data_test[‘solutions’])