How to do text classification on long sequence?

Hi all,

I am seeing a similar error:

ErrorMessage "RuntimeError: unique_by_key: failed to synchronize: cudaErrorAssert: device-side
 assert triggered

I believe this is the assert that failed:

../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.

Goal:

I am trying to fine-tune distilbert-base-uncased using my own dataset (4 classes), using the SageMaker HuggingFace deep learning container. Versions:

!pip install -q transformers==4.26.0 datasets[s3]==2.9.0

And here’s the code:

# hyperparameters which are passed to the training job
hyperparameters={
    'epochs': 1,
    'train_batch_size': 8,
    'model_name': 'distilbert-base-uncased'
}

git_config = {'repo': 'https://github.com/huggingface/notebooks.git','branch': 'main'}

# create the Estimator
huggingface_estimator = HuggingFace(
        entry_point='train.py',
        source_dir='./sagemaker/01_getting_started_pytorch/scripts',
        git_config=git_config,
        instance_type='ml.p3.2xlarge',
        instance_count=1,
        role=role,
        transformers_version='4.26.0',
        pytorch_version='1.13.1', 
        py_version='py39',
        hyperparameters = hyperparameters
)

# starting the train job
huggingface_estimator.fit({'train': train['hf_input_path'], 'test': test['hf_input_path']})

My train and test files are s3 URIs.
I created the files by loading the CSV files locally using load_dataset, tokenizing as below, and saving using dataset.save_to_disk.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

# tokenizer helper function
def tokenize(batch):
    return tokenizer(batch['Text'], padding='max_length', truncation=True)

Not sure why the n_classes assertion would fail - my understanding is HF will automatically detect the new number of classes.

Any help appreciated.
Thanks
Trish