How to do text classification on long sequence?

I just started using HF so please bear with me. And no, I’ve never work on tokenizers before nor have I done text classification. These are all new to me.

I’ve been trying to run text classification for legal documents for a while now, but everything fails at some point or the other.

For starter I tried using base models like “nlpaueb/legal-bert-small-uncased”. It worked somewhat, except many of the inputs are truncated, since the input sequence can get long (max length of words is 3k words). That affects the prediction accuracy (biased data?), so I decided to take a deeper dive.

I then tried to use LED or Longformer as suggested by HF articles… but I keep getting errors when trying to use them.

# tokenize function
def preprocess_function(examples):
    padding = "max_length"
    return tokenizer(examples["text"], 
                            return_tensors='pt',
                            padding=padding,
                            pad_to_max_length=True, 
                            max_length=max_length, # default is 6144
                            truncation=True)

# tokenize
from transformers import AutoTokenizer

tokenizer_legal_bert = "nlpaueb/legal-bert-small-uncased"
tokenizer_legal_led = "nsi319/legal-led-base-16384" # https://huggingface.co/nsi319/legal-led-base-16384

tokenizer = AutoTokenizer.from_pretrained(tokenizer_legal_led)
tokenized_dataset_text = dataset_split.map(preprocess_function, batched=True)

from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# evaluation
import evaluate
accuracy = evaluate.load("accuracy")

import numpy as np
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

# labels
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}
# using model
legal_bert_og = "nlpaueb/legal-bert-small-uncased"
legal_led = "nsi319/legal-led-base-16384"
led_base = "allenai/led-base-16384"
checkpoint = legal_led

model = AutoModelForSequenceClassification.from_pretrained(
    checkpoint, 
    num_labels=2, 
    id2label=id2label, 
    label2id=label2id,
    from_tf=True
)

Option 1:
If I use legal_bert_og (“nlpaueb/legal-bert-small-uncased”),
then the code runs here, but later on during training, it’d get an error that I have no idea how to fix.
The code on training the model:

model_name = "legal_bert_led_test"

training_args = TrainingArguments(
    output_dir = model_name,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True, # set to true to uploading
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset_text['train'],
    eval_dataset=tokenized_dataset_text['test'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

Then I’ll get this error:

/content/legal_bert_led_test is already a clone of https://huggingface.co/wiorz/legal_bert_led_test. Make sure you pull the latest changes with `repo.git_pull()`.
WARNING:huggingface_hub.repository:/content/legal_bert_led_test is already a clone of https://huggingface.co/wiorz/legal_bert_led_test. Make sure you pull the latest changes with `repo.git_pull()`.
The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: Unnamed: 0, text. If Unnamed: 0, text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
/usr/local/lib/python3.9/dist-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 800
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 150
  Number of trainable parameters = 35069442
You're using a LEDTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

<ipython-input-50-dbc7cc02f7c3> in <module>
     24 )
     25 
---> 26 trainer.train()

4 frames

/usr/local/lib/python3.9/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
    195     # some Python versions print out the first line of a multi-line function
    196     # calls in the traceback and some print out the last line
--> 197     Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    198         tensors, grad_tensors_, retain_graph, create_graph, inputs,
    199         allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass

RuntimeError: unique_by_key: failed to synchronize: cudaErrorAssert: device-side assert triggered.

Option 2:
If I use legal_led (nsi319/legal-led-base-16384), then it just doesn’t work at the model.from_pretraining level, because I’m using pytorch.

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--nsi319--legal-led-base-16384/snapshots/d1c0c7730126e04e5c1efd991ef78f4eb0513def/config.json
Model config LEDConfig {
  "_name_or_path": "nsi319/legal-led-base-16384",
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "architectures": [
    "LEDForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "attention_window": [
    1024,
    1024,
    1024,
    1024,
    1024,
    1024
  ],
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 768,
  "decoder_attention_heads": 12,
  "decoder_ffn_dim": 3072,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "encoder_attention_heads": 12,
  "encoder_ffn_dim": 3072,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_decoder_position_embeddings": 1024,
  "max_encoder_position_embeddings": 16384,
  "model_type": "led",
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "transformers_version": "4.26.1",
  "use_cache": true,
  "vocab_size": 50265
}

---------------------------------------------------------------------------

OSError                                   Traceback (most recent call last)

<ipython-input-52-a90dd8e88172> in <module>
      7 
      8 
----> 9 model = AutoModelForSequenceClassification.from_pretrained(
     10     checkpoint,
     11     num_labels=2,

1 frames

/usr/local/lib/python3.9/dist-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   2251                             )
   2252                         else:
-> 2253                             raise EnvironmentError(
   2254                                 f"{pretrained_model_name_or_path} does not appear to have a file named {WEIGHTS_NAME},"
   2255                                 f" {TF2_WEIGHTS_NAME}, {TF_WEIGHTS_NAME} or {FLAX_WEIGHTS_NAME}."

OSError: nsi319/legal-led-base-16384 does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.

I guess if I use tensorflow maybe it’d work?


Option 3:
If I use led_base (“allenai/led-base-16384”), it also fails at the model.from_pretraining level:

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--allenai--led-base-16384/snapshots/38335783885b338d93791936c54bb4be46bebed9/config.json
Model config LEDConfig {
  "_name_or_path": "allenai/led-base-16384",
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "architectures": [
    "LEDForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "attention_window": [
    1024,
    1024,
    1024,
    1024,
    1024,
    1024
  ],
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 768,
  "decoder_attention_heads": 12,
  "decoder_ffn_dim": 3072,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 6,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "encoder_attention_heads": 12,
  "encoder_ffn_dim": 3072,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 6,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "init_std": 0.02,
  "is_encoder_decoder": true,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_decoder_position_embeddings": 1024,
  "max_encoder_position_embeddings": 16384,
  "model_type": "led",
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "transformers_version": "4.26.1",
  "use_cache": true,
  "vocab_size": 50265
}

loading weights file tf_model.h5 from cache at /root/.cache/huggingface/hub/models--allenai--led-base-16384/snapshots/38335783885b338d93791936c54bb4be46bebed9/tf_model.h5
/usr/local/lib/python3.9/dist-packages/transformers/models/led/modeling_led.py:2533: FutureWarning: The `transformers.LEDForSequenceClassification` class is deprecated and will be removed in version 5 of Transformers. No actual method were provided in the original paper on how to perfom sequence classification.
  warnings.warn(
Loading TensorFlow weights from /root/.cache/huggingface/hub/models--allenai--led-base-16384/snapshots/38335783885b338d93791936c54bb4be46bebed9/tf_model.h5

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-51-4c038c80ab49> in <module>
      7 
      8 
----> 9 model = AutoModelForSequenceClassification.from_pretrained(
     10     checkpoint,
     11     num_labels=2,

3 frames

/usr/local/lib/python3.9/dist-packages/transformers/utils/import_utils.py in __getattr__(self, name)
   1101             value = getattr(module, name)
   1102         else:
-> 1103             raise AttributeError(f"module {self.__name__} has no attribute {name}")
   1104 
   1105         setattr(self, name, value)

AttributeError: module transformers has no attribute TFLEDForSequenceClassification

Anyway, I’m very lost at this moment and not sure what to do.
I need to use a tokenizer that can handle long input sequence, but not sure if the model also needs to be trained to use longer sequence. I don’t know what will happen if I feed longer tokens into a non-longformer trained model, or if they can even work.
As for the models themselves, the ones that’s supposed to work just gives me errors. If I use something that works for for shorter inputs (“nlpaueb/legal-bert-small-uncased”, then it doesn’t work with the longer tokens, I think.


TL;DR: How do I train text classification on long sequences? What am I doing wrong?

I am also just learning HF :wink: I am also getting the error:

RuntimeError: unique_by_key: failed to synchronize: cudaErrorAssert: device-side assert triggered.

My setup is as follows:

batch_size = 16
logging_steps = len(self.defects_encoded["train"]) // batch_size
model_name = f"{self.model_ckpt}-finetuned-defects"
training_args = TrainingArguments(output_dir=model_name,
                                  num_train_epochs=2,
                                  learning_rate=2e-5,
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  weight_decay=0.01,
                                  evaluation_strategy="epoch",
                                  disable_tqdm=False,
                                  logging_steps=logging_steps,
                                  push_to_hub=True,
                                  log_level="error")
 
trainer = Trainer(model=self.model, args=training_args,
                  compute_metrics=self.compute_metrics,
                  train_dataset=self.defects_encoded["train"],
                  eval_dataset=self.defects_encoded["validation"],
                  tokenizer=self.tokenizer)
trainer.train();

The error with more context reads:
/usr/local/lib/python3.8/dist-packages/torch/autograd/init.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
195 # some Python versions print out the first line of a multi-line function
196 # calls in the traceback and some print out the last line
→ 197 Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
198 tensors, grad_tensors
, retain_graph, create_graph, inputs,
199 allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass

RuntimeError: unique_by_key: failed to synchronize: cudaErrorAssert: device-side assert triggered

Hi all,

I am seeing a similar error:

ErrorMessage "RuntimeError: unique_by_key: failed to synchronize: cudaErrorAssert: device-side
 assert triggered

I believe this is the assert that failed:

../aten/src/ATen/native/cuda/Loss.cu:242: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.

Goal:

I am trying to fine-tune distilbert-base-uncased using my own dataset (4 classes), using the SageMaker HuggingFace deep learning container. Versions:

!pip install -q transformers==4.26.0 datasets[s3]==2.9.0

And here’s the code:

# hyperparameters which are passed to the training job
hyperparameters={
    'epochs': 1,
    'train_batch_size': 8,
    'model_name': 'distilbert-base-uncased'
}

git_config = {'repo': 'https://github.com/huggingface/notebooks.git','branch': 'main'}

# create the Estimator
huggingface_estimator = HuggingFace(
        entry_point='train.py',
        source_dir='./sagemaker/01_getting_started_pytorch/scripts',
        git_config=git_config,
        instance_type='ml.p3.2xlarge',
        instance_count=1,
        role=role,
        transformers_version='4.26.0',
        pytorch_version='1.13.1', 
        py_version='py39',
        hyperparameters = hyperparameters
)

# starting the train job
huggingface_estimator.fit({'train': train['hf_input_path'], 'test': test['hf_input_path']})

My train and test files are s3 URIs.
I created the files by loading the CSV files locally using load_dataset, tokenizing as below, and saving using dataset.save_to_disk.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

# tokenizer helper function
def tokenize(batch):
    return tokenizer(batch['Text'], padding='max_length', truncation=True)

Not sure why the n_classes assertion would fail - my understanding is HF will automatically detect the new number of classes.

Any help appreciated.
Thanks
Trish

I also ran into this error while running a pytorch code, and later found it was an issue with the dataset only. Actually I was using a replication package where the origianl authors have used labels in their dataset starting from ‘0’, while the dataset which I was testing it upon had labels starting from ‘1’. later I changed it to zero and it solved the issue.