Hierarchy classification network: Having trouble preparing the dataset

GUOKP · August 29, 2021, 1:41pm

Hi all,

I am implementing a custom neural network to deal with hierarchy classification problems using the transformers library… I have trouble preparing the dataset and customize my Trainer…

The datasets I have at hands are:

label1, 2, 3, 4 refer to the 4 levels of classification problems…

The trainer instance built by this…

def build_trainer(model, tokenizer, train_dataset, eval_dataset, output_dir,
                  evaluation_strategy, learning_rate=3e-5, batch_size=16,
                  num_train_epochs=2, weight_decay=0.01, early_stopping_patience=2,
                  save_steps=1000):
    args = TrainingArguments(
        output_dir,
        evaluation_strategy=evaluation_strategy,
        learning_rate=learning_rate,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=num_train_epochs,
        weight_decay=weight_decay,
        save_strategy=evaluation_strategy,
        logging_strategy=evaluation_strategy,
        save_steps=save_steps,
        logging_steps=save_steps,
        logging_dir=f"{output_dir}_log",
        load_best_model_at_end=True,
        label_names=["label1", "label2", "label3", "label4"]
    )

    return MultiHeadsTrainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        # data_collator=default_data_collator,
        tokenizer=tokenizer,
        callbacks=[EarlyStoppingCallback(
            early_stopping_patience=early_stopping_patience)]
    )

The custom Trainer:

class MultiHeadsTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        # TODO
        print(inputs)

Any ideas about write the custom compute_loss() function for hierarchy classification…

The error occurs when I build the Trainer and tries to run Trainer.train()
Using custom data configuration default-2741be2a4726c3d5
Reusing dataset csv (/home/guo/.cache/huggingface/datasets/csv/default-2741be2a4726c3d5/0.0.0)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.76ba/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5.74ba/s]
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: [‘vocab_layer_norm.weight’, ‘vocab_transform.bias’, ‘vocab_projector.bias’, ‘vocab_projector.weight’, ‘vocab_layer_norm.bias’, ‘vocab_transform.weight’]

This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks…
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks…
To disable this warning, you can either:
- Avoid using tokenizers before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
0%| | 0/82 [00:00<?, ?it/s]Traceback (most recent call last):
File “multiHeadsTrainer.py”, line 73, in
trainer.train()
File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/transformers/trainer.py”, line 1246, in train
for step, inputs in enumerate(epoch_iterator):
File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/torch/utils/data/dataloader.py”, line 517, in next
data = self._next_data()
File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/torch/utils/data/dataloader.py”, line 557, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py”, line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py”, line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 1517, in getitem
return self._getitem(
File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 1509, in _getitem
pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/datasets/formatting/formatting.py”, line 368, in query_table
_check_valid_index_key(key, size)
File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/datasets/formatting/formatting.py”, line 311, in _check_valid_index_key
raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
IndexError: Invalid key: 423 is out of bounds for size 0

I am new to the Trainer class, before I just write the train epoch function… Any ideas or guidance will be appreciated.

Topic		Replies	Views
Issues with Trainer class on custom dataset 🤗Transformers	3	7282	June 14, 2023
Dataset for training BlenderBot 🤗Transformers	1	2497	May 1, 2021
[NER] Fine-tune with custom dataset - Index Error: Target out of bounds Beginners	8	2297	August 18, 2023
Implementing a Trainer with custom loss produces key error 🤗Accelerate	2	3110	April 30, 2023
Using Trainer with custom model and custom dataset Beginners	1	4191	May 7, 2023

Hierarchy classification network: Having trouble preparing the dataset

Related topics