How to use CodeGen

Hi!

I’m trying to use CodeGen 350m Mono for transfer learning. However, I don’t understand how the CodeGen’s tokenizer works.


Here’s my code:

test_data = datasets.load_dataset(“codeparrot/apps”, “all”, split=“test”)
train_data = datasets.load_dataset(“codeparrot/apps”, “all”, split=“train”)

checkpoint = “Salesforce/codegen-350M-mono”
tokenizer = CodeGenTokenizer.from_pretrained(checkpoint)

model = CodeGenModel.from_pretrained(“Salesforce/codegen-350M-mono”)
model.save_pretrained(‘./model’)

data_train = test_data.to_pandas()
data_train = pd.DataFrame(data_train)
data_test = train_data.to_pandas()
data_test = pd.DataFrame(data_test)

train_sentences =
train_labels =

test_sentences =
test_labels =

for data in data_train[[‘question’, ‘solutions’, ‘input_output’, ‘difficulty’]]:
train_sentences.append(data_train[‘question’])
train_sentences.append(data_train[‘input_output’])
train_sentences.append(data_train[‘difficulty’])
train_labels.append(data_train[‘solutions’])

for data in data_test[[‘question’, ‘solutions’, ‘input_output’, ‘difficulty’]]:
test_sentences.append(data_test[‘question’])
test_sentences.append(data_test[‘input_output’])
test_sentences.append(data_test[‘difficulty’])
test_labels.append(data_test[‘solutions’])

I would like to tokenize those lists in order to train CodeGen adding new layers (something similar to ‘tokenizer.texts_to_sequences’ and ‘pad_sequences’.

Could someone help me?

Hi!

There are different ways to solve your problem, but I recommend using the Dataset library, it allows you to tokenize the whole corpus in an easy way.
Note that in the code below I read both train/test data at the same time. Then you can get each of them as a dict (dataset[“train”], dataset[“test”]).

Here is a summary of the steps:

  1. Read the data with datasets library.
  2. Define a preprocess function to tokenize your data in your way. Due to you want to get the data in two different lists (sentences that contains question, input_output and difficulty and labels that contains solutions) I defined two different functions. The function is quite easy, it only join the desired data and tokenize it.
  3. Tokenize all data using map function from datasets.
import transformers
from datasets import load_dataset
from transformers import AutoTokenizer

dataset = load_dataset("codeparrot/apps", "all")

def preprocess_function_sentences(data):
    return tokenizer([x + y + z + " " for x, y, z in zip(data["question"], data["input_output"], data["difficulty"])], truncation=True, max_length=128)

def preprocess_function_label(data):
    return tokenizer([" ".join(x) for x in data["solutions"]], truncation=True, max_length=128)

tokenized_dataset_sentence = dataset.map(preprocess_function_sentences,
                                 batched=True,
                                 num_proc=4,
                                 remove_columns=dataset["train"].column_names)

tokenized_dataset_label = dataset.map(preprocess_function_label,
                                 batched=True,
                                 num_proc=4,
                                 remove_columns=dataset["train"].column_names)

Hope this can help you!

PS: In the :hugs: docs you can find more detailed information (Process text data).

Cheers,
Ramón

1 Like