I just started using HF so please bear with me. And no, I’ve never work on tokenizers before nor have I done text classification. These are all new to me.
I’ve been trying to run text classification for legal documents for a while now, but everything fails at some point or the other.
For starter I tried using base models like “nlpaueb/legal-bert-small-uncased”. It worked somewhat, except many of the inputs are truncated, since the input sequence can get long (max length of words is 3k words). That affects the prediction accuracy (biased data?), so I decided to take a deeper dive.
I then tried to use LED or Longformer as suggested by HF articles… but I keep getting errors when trying to use them.
# tokenize function
def preprocess_function(examples):
padding = "max_length"
return tokenizer(examples["text"],
return_tensors='pt',
padding=padding,
pad_to_max_length=True,
max_length=max_length, # default is 6144
truncation=True)
# tokenize
from transformers import AutoTokenizer
tokenizer_legal_bert = "nlpaueb/legal-bert-small-uncased"
tokenizer_legal_led = "nsi319/legal-led-base-16384" # https://huggingface.co/nsi319/legal-led-base-16384
tokenizer = AutoTokenizer.from_pretrained(tokenizer_legal_led)
tokenized_dataset_text = dataset_split.map(preprocess_function, batched=True)
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# evaluation
import evaluate
accuracy = evaluate.load("accuracy")
import numpy as np
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return accuracy.compute(predictions=predictions, references=labels)
# labels
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}
# using model
legal_bert_og = "nlpaueb/legal-bert-small-uncased"
legal_led = "nsi319/legal-led-base-16384"
led_base = "allenai/led-base-16384"
checkpoint = legal_led
model = AutoModelForSequenceClassification.from_pretrained(
checkpoint,
num_labels=2,
id2label=id2label,
label2id=label2id,
from_tf=True
)
Option 1:
If I use legal_bert_og (“nlpaueb/legal-bert-small-uncased”),
then the code runs here, but later on during training, it’d get an error that I have no idea how to fix.
The code on training the model:
model_name = "legal_bert_led_test"
training_args = TrainingArguments(
output_dir = model_name,
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
push_to_hub=True, # set to true to uploading
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset_text['train'],
eval_dataset=tokenized_dataset_text['test'],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
Then I’ll get this error:
/content/legal_bert_led_test is already a clone of https://huggingface.co/wiorz/legal_bert_led_test. Make sure you pull the latest changes with `repo.git_pull()`.
WARNING:huggingface_hub.repository:/content/legal_bert_led_test is already a clone of https://huggingface.co/wiorz/legal_bert_led_test. Make sure you pull the latest changes with `repo.git_pull()`.
The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: Unnamed: 0, text. If Unnamed: 0, text are not expected by `BertForSequenceClassification.forward`, you can safely ignore this message.
/usr/local/lib/python3.9/dist-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
***** Running training *****
Num examples = 800
Num Epochs = 3
Instantaneous batch size per device = 16
Total train batch size (w. parallel, distributed & accumulation) = 16
Gradient Accumulation steps = 1
Total optimization steps = 150
Number of trainable parameters = 35069442
You're using a LEDTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-50-dbc7cc02f7c3> in <module>
24 )
25
---> 26 trainer.train()
4 frames
/usr/local/lib/python3.9/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
195 # some Python versions print out the first line of a multi-line function
196 # calls in the traceback and some print out the last line
--> 197 Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
198 tensors, grad_tensors_, retain_graph, create_graph, inputs,
199 allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: unique_by_key: failed to synchronize: cudaErrorAssert: device-side assert triggered.
Option 2:
If I use legal_led (nsi319/legal-led-base-16384), then it just doesn’t work at the model.from_pretraining level, because I’m using pytorch.
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--nsi319--legal-led-base-16384/snapshots/d1c0c7730126e04e5c1efd991ef78f4eb0513def/config.json
Model config LEDConfig {
"_name_or_path": "nsi319/legal-led-base-16384",
"activation_dropout": 0.0,
"activation_function": "gelu",
"architectures": [
"LEDForConditionalGeneration"
],
"attention_dropout": 0.0,
"attention_window": [
1024,
1024,
1024,
1024,
1024,
1024
],
"bos_token_id": 0,
"classif_dropout": 0.0,
"classifier_dropout": 0.0,
"d_model": 768,
"decoder_attention_heads": 12,
"decoder_ffn_dim": 3072,
"decoder_layerdrop": 0.0,
"decoder_layers": 6,
"decoder_start_token_id": 2,
"dropout": 0.1,
"encoder_attention_heads": 12,
"encoder_ffn_dim": 3072,
"encoder_layerdrop": 0.0,
"encoder_layers": 6,
"eos_token_id": 2,
"gradient_checkpointing": false,
"id2label": {
"0": "NEGATIVE",
"1": "POSITIVE"
},
"init_std": 0.02,
"is_encoder_decoder": true,
"label2id": {
"NEGATIVE": 0,
"POSITIVE": 1
},
"max_decoder_position_embeddings": 1024,
"max_encoder_position_embeddings": 16384,
"model_type": "led",
"num_hidden_layers": 6,
"pad_token_id": 1,
"transformers_version": "4.26.1",
"use_cache": true,
"vocab_size": 50265
}
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-52-a90dd8e88172> in <module>
7
8
----> 9 model = AutoModelForSequenceClassification.from_pretrained(
10 checkpoint,
11 num_labels=2,
1 frames
/usr/local/lib/python3.9/dist-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
2251 )
2252 else:
-> 2253 raise EnvironmentError(
2254 f"{pretrained_model_name_or_path} does not appear to have a file named {WEIGHTS_NAME},"
2255 f" {TF2_WEIGHTS_NAME}, {TF_WEIGHTS_NAME} or {FLAX_WEIGHTS_NAME}."
OSError: nsi319/legal-led-base-16384 does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.
I guess if I use tensorflow maybe it’d work?
Option 3:
If I use led_base (“allenai/led-base-16384”), it also fails at the model.from_pretraining level:
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--allenai--led-base-16384/snapshots/38335783885b338d93791936c54bb4be46bebed9/config.json
Model config LEDConfig {
"_name_or_path": "allenai/led-base-16384",
"activation_dropout": 0.0,
"activation_function": "gelu",
"architectures": [
"LEDForConditionalGeneration"
],
"attention_dropout": 0.0,
"attention_window": [
1024,
1024,
1024,
1024,
1024,
1024
],
"bos_token_id": 0,
"classif_dropout": 0.0,
"classifier_dropout": 0.0,
"d_model": 768,
"decoder_attention_heads": 12,
"decoder_ffn_dim": 3072,
"decoder_layerdrop": 0.0,
"decoder_layers": 6,
"decoder_start_token_id": 2,
"dropout": 0.1,
"encoder_attention_heads": 12,
"encoder_ffn_dim": 3072,
"encoder_layerdrop": 0.0,
"encoder_layers": 6,
"eos_token_id": 2,
"gradient_checkpointing": false,
"id2label": {
"0": "NEGATIVE",
"1": "POSITIVE"
},
"init_std": 0.02,
"is_encoder_decoder": true,
"label2id": {
"NEGATIVE": 0,
"POSITIVE": 1
},
"max_decoder_position_embeddings": 1024,
"max_encoder_position_embeddings": 16384,
"model_type": "led",
"num_hidden_layers": 6,
"pad_token_id": 1,
"transformers_version": "4.26.1",
"use_cache": true,
"vocab_size": 50265
}
loading weights file tf_model.h5 from cache at /root/.cache/huggingface/hub/models--allenai--led-base-16384/snapshots/38335783885b338d93791936c54bb4be46bebed9/tf_model.h5
/usr/local/lib/python3.9/dist-packages/transformers/models/led/modeling_led.py:2533: FutureWarning: The `transformers.LEDForSequenceClassification` class is deprecated and will be removed in version 5 of Transformers. No actual method were provided in the original paper on how to perfom sequence classification.
warnings.warn(
Loading TensorFlow weights from /root/.cache/huggingface/hub/models--allenai--led-base-16384/snapshots/38335783885b338d93791936c54bb4be46bebed9/tf_model.h5
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-51-4c038c80ab49> in <module>
7
8
----> 9 model = AutoModelForSequenceClassification.from_pretrained(
10 checkpoint,
11 num_labels=2,
3 frames
/usr/local/lib/python3.9/dist-packages/transformers/utils/import_utils.py in __getattr__(self, name)
1101 value = getattr(module, name)
1102 else:
-> 1103 raise AttributeError(f"module {self.__name__} has no attribute {name}")
1104
1105 setattr(self, name, value)
AttributeError: module transformers has no attribute TFLEDForSequenceClassification
Anyway, I’m very lost at this moment and not sure what to do.
I need to use a tokenizer that can handle long input sequence, but not sure if the model also needs to be trained to use longer sequence. I don’t know what will happen if I feed longer tokens into a non-longformer trained model, or if they can even work.
As for the models themselves, the ones that’s supposed to work just gives me errors. If I use something that works for for shorter inputs (“nlpaueb/legal-bert-small-uncased”, then it doesn’t work with the longer tokens, I think.
TL;DR: How do I train text classification on long sequences? What am I doing wrong?