Hierarchy classification network: Having trouble preparing the dataset

Hi all,

I am implementing a custom neural network to deal with hierarchy classification problems using the transformers library… I have trouble preparing the dataset and customize my Trainer…

The datasets I have at hands are:


label1, 2, 3, 4 refer to the 4 levels of classification problems…

The trainer instance built by this…

def build_trainer(model, tokenizer, train_dataset, eval_dataset, output_dir,
                  evaluation_strategy, learning_rate=3e-5, batch_size=16,
                  num_train_epochs=2, weight_decay=0.01, early_stopping_patience=2,
                  save_steps=1000):
    args = TrainingArguments(
        output_dir,
        evaluation_strategy=evaluation_strategy,
        learning_rate=learning_rate,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=num_train_epochs,
        weight_decay=weight_decay,
        save_strategy=evaluation_strategy,
        logging_strategy=evaluation_strategy,
        save_steps=save_steps,
        logging_steps=save_steps,
        logging_dir=f"{output_dir}_log",
        load_best_model_at_end=True,
        label_names=["label1", "label2", "label3", "label4"]
    )

    return MultiHeadsTrainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        # data_collator=default_data_collator,
        tokenizer=tokenizer,
        callbacks=[EarlyStoppingCallback(
            early_stopping_patience=early_stopping_patience)]
    )

The custom Trainer:

class MultiHeadsTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        # TODO
        print(inputs)

Any ideas about write the custom compute_loss() function for hierarchy classification…

The error occurs when I build the Trainer and tries to run Trainer.train()
Using custom data configuration default-2741be2a4726c3d5
Reusing dataset csv (/home/guo/.cache/huggingface/datasets/csv/default-2741be2a4726c3d5/0.0.0)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.76ba/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5.74ba/s]
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: [‘vocab_layer_norm.weight’, ‘vocab_transform.bias’, ‘vocab_projector.bias’, ‘vocab_projector.weight’, ‘vocab_layer_norm.bias’, ‘vocab_transform.weight’]

  • This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks…
    To disable this warning, you can either:
    - Avoid using tokenizers before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
    huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks…
    To disable this warning, you can either:
    - Avoid using tokenizers before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
    0%| | 0/82 [00:00<?, ?it/s]Traceback (most recent call last):
    File “multiHeadsTrainer.py”, line 73, in
    trainer.train()
    File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/transformers/trainer.py”, line 1246, in train
    for step, inputs in enumerate(epoch_iterator):
    File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/torch/utils/data/dataloader.py”, line 517, in next
    data = self._next_data()
    File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/torch/utils/data/dataloader.py”, line 557, in _next_data
    data = self._dataset_fetcher.fetch(index) # may raise StopIteration
    File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py”, line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
    File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py”, line 44, in
    data = [self.dataset[idx] for idx in possibly_batched_index]
    File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 1517, in getitem
    return self._getitem(
    File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/datasets/arrow_dataset.py”, line 1509, in _getitem
    pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
    File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/datasets/formatting/formatting.py”, line 368, in query_table
    _check_valid_index_key(key, size)
    File “/home/guo/anaconda3/envs/qanswer/lib/python3.8/site-packages/datasets/formatting/formatting.py”, line 311, in _check_valid_index_key
    raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
    IndexError: Invalid key: 423 is out of bounds for size 0

I am new to the Trainer class, before I just write the train epoch function… Any ideas or guidance will be appreciated.

1 Like