Can't pickle error using accelerate multi-GPU

Hi all, I’m relatively new to Huggingface, Transformers and especially Accelerate.
I’m trying to fine-tune to the CodeGen model using four GPUs, distributing the training across each GPU to speed up compute and prevent running out of CUDA memory. Note that I don’t want to replicate the model on each GPU, just distribute the computation. I’m currently just working on this for the CodeGen350M model, however once this is working as expected I’m planning to retrain the 6B/16B CodeGen models. The dataset I’m opening is one I curated myself, it consists of around 1800 prompts with corresponding code. I’m aware this might be a small dataset for fine-tuning, I intend to add more data incrementally once the full workflow has been developed.

Here are some details of my setup:

Hardware: Four GeForce RTX 3090ti (24gb each)
Python: 3.9.16
pip: 23.0
torch: 1.13.1
accelerate: 0.16.0
transformers: 4.26.0
multiprocess: 0.70.12.2
CUDA Version: 11.4

It seems the training initialises on all GPUs fine, but at the start of the training loop I’m getting this:

Launching training on 4 GPUs.
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 2
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 3
INFO:torch.distributed.distributed_c10d:Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
INFO:torch.distributed.distributed_c10d:Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 4 nodes.
Using cuda_amp half precision backend
/home/jona/Projects/CodeGenFineTune/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/home/jona/Projects/CodeGenFineTune/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/home/jona/Projects/CodeGenFineTune/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/home/jona/Projects/CodeGenFineTune/lib/python3.9/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 44
  Num Epochs = 4
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 2
  Total optimization steps = 4
  Number of trainable parameters = 356712448
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
ProcessRaisedException                    Traceback (most recent call last)
Cell In[3], line 2
      1 from accelerate import notebook_launcher
----> 2 notebook_launcher(train, num_processes=4)

File ~/Projects/CodeGenFineTune/lib/python3.9/site-packages/accelerate/launchers.py:136, in notebook_launcher(function, args, num_processes, mixed_precision, use_port)
    133         launcher = PrepareForLaunch(function, distributed_type="MULTI_GPU")
    135         print(f"Launching training on {num_processes} GPUs.")
--> 136         start_processes(launcher, args=args, nprocs=num_processes, start_method="fork")
    138 else:
    139     # No need for a distributed launch otherwise as it's either CPU, GPU or MPS.
    140     use_mps_device = "false"

File ~/Projects/CodeGenFineTune/lib/python3.9/site-packages/torch/multiprocessing/spawn.py:198, in start_processes(fn, args, nprocs, join, daemon, start_method)
    195     return context
    197 # Loop on join until it returns True or raises an exception.
--> 198 while not context.join():
    199     pass

File ~/Projects/CodeGenFineTune/lib/python3.9/site-packages/torch/multiprocessing/spawn.py:160, in ProcessContext.join(self, timeout)
    158 msg = "\n\n-- Process %d terminated with the following error:\n" % error_index
    159 msg += original_trace
--> 160 raise ProcessRaisedException(msg, error_index, failed_process.pid)
...
    reduction.dump(process_obj, fp)
  File "/usr/lib/python3.9/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function Embedding.forward at 0x7fba389cddc0>: it's not the same object as torch.nn.modules.sparse.Embedding.forward

I’m unsure why the number of examples printed by the trainer is just 44, it should be around 1800…

The code I’m using is a bit of a mess, but it’s mostly working my end, I’m not sure if I’m preparing my data correctly or initialising the training using accelerate correctly. I tried to follow documentation as best I could, even using ChatGPT in places.
My code is in a Jupyter Notebook, I tried using the accelerate notebook_launcher and without (by just calling .prepare and then .train without notebook_launcher) but both methods are causing the same cannot pickle function error. See code below:

import torch.multiprocessing as mp
mp.set_sharing_strategy('file_system')
mp.set_start_method(method='spawn')
import logging
logging.basicConfig(level=logging.INFO)
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModel, DataCollatorForSeq2Seq, AdamW, Seq2SeqTrainingArguments, Seq2SeqTrainer, CodeGenForCausalLM, CodeGenTokenizer # investigate CodeGen varients to see if they are faster
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
from datasets import Dataset
from accelerate import Accelerator
import os
#os.environ['TOKENIZERS_PARALLELISM'] = 'false' # won't support tokenizers on multiple GPUs, so we don't parallelise, this shouldn't affect training too much
#from transformers import AutoModelForCausalLM, AutoTokenizer, EncoderDecoderModel, EncoderDecoderConfig, AutoConfig
training_args = Seq2SeqTrainingArguments(
    output_dir="./CodeGenFineTune",
    per_device_train_batch_size=4,  # Batch size per GPU
    per_device_eval_batch_size=4,   # Batch size per GPU
    num_train_epochs=4,
    gradient_accumulation_steps=2,  # Effective batch size = per_device_train_batch_size * gradient_accumulation_steps * num_gpus
    evaluation_strategy="steps",
    save_strategy="steps",
    save_steps=500,
    eval_steps=500,
    logging_steps=100,
    learning_rate=2e-5,
    warmup_steps=100,
    fp16=True,  # Mixed precision training
    dataloader_num_workers=1,  # Number of workers for data loading
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False
)
tokenizer = CodeGenTokenizer.from_pretrained("Salesforce/codegen-350M-mono", device_map='auto')
model = CodeGenForCausalLM.from_pretrained("Salesforce/codegen-350M-mono", device_map='auto')
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
tokenizer.add_special_tokens({'pad_token': '<pad>'})
data = pd.read_csv('./Prompt_Dataset_Complete.csv')

# shuffle rows of dataset
data = data.sample(frac=1).reset_index(drop=True)

inputs = list(data['Prompt'])
labels = list(data['Code'])

train_data, eval_data, train_labels, eval_labels = train_test_split(inputs, labels, test_size=0.2)
print('Train size:', len(train_data), ' Test size:', len(eval_data))
train_data = [{"input": train_data, "label": train_labels} for train_data, train_labels in zip(train_data, train_labels)]
eval_data = [{"input": eval_data, "label": eval_labels} for eval_data, eval_labels in zip(eval_data, eval_labels)]
train_inputs = [example["input"] for example in train_data]
train_labels = [example["label"] for example in train_data]

eval_inputs = [example["input"] for example in eval_data]
eval_labels = [example["label"] for example in eval_data]

# Tokenize data and add the input_ids and attention_mask fields
train_encodings = tokenizer(train_inputs, padding=True, truncation=True, return_tensors="pt")
train_encodings["labels"] = tokenizer(train_labels, padding=True, truncation=True, return_tensors="pt")["input_ids"]
train_dataset = Dataset.from_dict(train_encodings)
eval_encodings = tokenizer(eval_inputs, padding=True, truncation=True, return_tensors="pt")
eval_encodings["labels"] = tokenizer(eval_labels, padding=True, truncation=True, return_tensors="pt")["input_ids"]
eval_dataset = Dataset.from_dict(eval_encodings)

# set data to torch, use GPU0 to avoid tensors not on same device error
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'], device='cuda:0')
eval_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'], device='cuda:0')

# create data loaders
train_loader = DataLoader(train_dataset, batch_size=32, num_workers=4, pin_memory=True)
eval_loader = DataLoader(eval_dataset, batch_size=32, num_workers=4, pin_memory=True)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors='pt')
accelerator = Accelerator(device_placement='cpu', split_batches=True)

# prepare accelerator
trainer, data_collator, optimizer = accelerator.prepare(Seq2SeqTrainer(
    model = model,
    args=training_args,
    train_dataset= train_loader,
    eval_dataset = eval_loader,
    data_collator=data_collator
), data_collator, optimizer)

trainer.train()

I tried passing into accelerate.prepare the DataLoader objects I defined as the train and eval dataset, as well as just Dataset objects, but both return the pickling error.

Has anyone encountered this before? I’d really appreciate some help here, any questions just let me know.
Thanks in advance,
John.

I’m sure you’ve read this Python multiprocessing PicklingError: Can't pickle <type 'function'> - Stack Overflow might have something to do with were you’re initializing the accelerator. If that doesn’t work I’d start taking the add-ons off until it starts working. Start with the Trainer. Just write a reg loop like in the Accelerate Tutorial.

I’d seen that before but it absolutely got me thinking.
I figured I should try and pickle everything I’m working with, and that led me to discover that the DataCollator I was using was what couldn’t be pickled. I’m now using the transformers DataCollatorWithPadding and that’s working just fine!

Thanks for your help :slight_smile:

1 Like

Did you solve it? I encountered the same problem.

Yes the problem was with the DataCollator. I recommend trying to pickle.dump all your objects you pass into the trainer / accelerate to see which cannot be pickled :slight_smile:

Like how? My accelerator prepare part is below, how do I ‘pickle.dump all your objects’ ?
def prepare(self):
self.model, self.optimizer, self.train_loader, self.val_loader, self.scheduler, self.data_collator = self.accelerator.prepare(
self.model, self.optimizer, self.train_loader, self.val_loader, self.scheduler, self.data_collator)

import pickle as pkl
and use the pickle.dump method eg:

pickle.dump()

you may find that all but one of your objects can be pickled, which will be the culprit