Hi all,
I am seeing a similar error:
ErrorMessage "RuntimeError: unique_by_key: failed to synchronize: cudaErrorAssert: device-side
assert triggered
I believe this is the assert that failed:
../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
Goal:
I am trying to fine-tune distilbert-base-uncased using my own dataset (4 classes), using the SageMaker HuggingFace deep learning container. Versions:
!pip install -q transformers==4.26.0 datasets[s3]==2.9.0
And here’s the code:
# hyperparameters which are passed to the training job
hyperparameters={
'epochs': 1,
'train_batch_size': 8,
'model_name': 'distilbert-base-uncased'
}
git_config = {'repo': 'https://github.com/huggingface/notebooks.git','branch': 'main'}
# create the Estimator
huggingface_estimator = HuggingFace(
entry_point='train.py',
source_dir='./sagemaker/01_getting_started_pytorch/scripts',
git_config=git_config,
instance_type='ml.p3.2xlarge',
instance_count=1,
role=role,
transformers_version='4.26.0',
pytorch_version='1.13.1',
py_version='py39',
hyperparameters = hyperparameters
)
# starting the train job
huggingface_estimator.fit({'train': train['hf_input_path'], 'test': test['hf_input_path']})
My train and test files are s3 URIs.
I created the files by loading the CSV files locally using load_dataset, tokenizing as below, and saving using dataset.save_to_disk.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
# tokenizer helper function
def tokenize(batch):
return tokenizer(batch['Text'], padding='max_length', truncation=True)
Not sure why the n_classes assertion would fail - my understanding is HF will automatically detect the new number of classes.
Any help appreciated.
Thanks
Trish