Hi all,
I am implementing a custom neural network to deal with hierarchy classification problems using the transformers library… I have trouble preparing the dataset and customize my Trainer…
The datasets I have at hands are:
label1, 2, 3, 4 refer to the 4 levels of classification problems…
The trainer instance built by this…
def build_trainer(model, tokenizer, train_dataset, eval_dataset, output_dir,
evaluation_strategy, learning_rate=3e-5, batch_size=16,
num_train_epochs=2, weight_decay=0.01, early_stopping_patience=2,
save_steps=1000):
args = TrainingArguments(
output_dir,
evaluation_strategy=evaluation_strategy,
learning_rate=learning_rate,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=num_train_epochs,
weight_decay=weight_decay,
save_strategy=evaluation_strategy,
logging_strategy=evaluation_strategy,
save_steps=save_steps,
logging_steps=save_steps,
logging_dir=f"{output_dir}_log",
load_best_model_at_end=True,
label_names=["label1", "label2", "label3", "label4"]
)
return MultiHeadsTrainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
# data_collator=default_data_collator,
tokenizer=tokenizer,
callbacks=[EarlyStoppingCallback(
early_stopping_patience=early_stopping_patience)]
)
The custom Trainer:
class MultiHeadsTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
# TODO
print(inputs)
Any ideas about write the custom compute_loss() function for hierarchy classification…
The error occurs when I build the Trainer and tries to run Trainer.train()
Using custom data configuration default-2741be2a4726c3d5
Reusing dataset csv (/home/guo/.cache/huggingface/datasets/csv/default-2741be2a4726c3d5/0.0.0)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.76ba/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5.74ba/s]
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: [‘vocab_layer_norm.weight’, ‘vocab_transform.bias’, ‘vocab_projector.bias’, ‘vocab_projector.weight’, ‘vocab_layer_norm.bias’, ‘vocab_transform.weight’]
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks…
To disable this warning, you can either:
- Avoid usingtokenizers
before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks…
To disable this warning, you can either:
- Avoid usingtokenizers
before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
0%| | 0/82 [00:00<?, ?it/s]Traceback (most recent call last):
File “multiHeadsTrainer.py”, line 73, in
trainer.train()
File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/transformers/trainer.py”, line 1246, in train
for step, inputs in enumerate(epoch_iterator):
File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/torch/utils/data/dataloader.py”, line 517, in next
data = self._next_data()
File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/torch/utils/data/dataloader.py”, line 557, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py”, line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py”, line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 1517, in getitem
return self._getitem(
File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 1509, in _getitem
pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/datasets/formatting/formatting.py”, line 368, in query_table
_check_valid_index_key(key, size)
File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/datasets/formatting/formatting.py”, line 311, in _check_valid_index_key
raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
IndexError: Invalid key: 423 is out of bounds for size 0
I am new to the Trainer class, before I just write the train epoch function… Any ideas or guidance will be appreciated.