Using datacollator for multi-task training

nid989 · January 22, 2022, 6:12pm

Hey everyone,
Actually, I am trying to accommodate multiple subtasks including STS-b and NER into a multi-task model, however am unable to issue the tokens from the conll dataset into the DataCollator. Can anyone help me with this? The code snippet is shown below.

class NLPDataCollator(DefaultDataCollator):
    """
    Extending the existing DataCollator to work with NLP dataset batches
    """
    def collate_batch(self, features: List[Union[InputDataClass, Dict]]) -> Dict[str, torch.Tensor]:
        first = features[0]
        if isinstance(first, dict):
          # NLP data sets current works presents features as lists of dictionary
          # (one per example), so we  will adapt the collate_batch logic for that
          if "labels" in first and first["labels"] is not None:
              if first["labels"].dtype == torch.int64:
                  labels = torch.tensor([f["labels"] for f in features], dtype=torch.long)
              else:
                  labels = torch.tensor([f["labels"] for f in features], dtype=torch.float)
              batch = {"labels": labels}
          for k, v in first.items():
              if k != "labels" and v is not None and not isinstance(v, str):
                  batch[k] = torch.stack([f[k] for f in features])
          return batch
        else:
          # otherwise, revert to using the default collate_batch
          return DefaultDataCollator().collate_batch(features)

nid989 · January 22, 2022, 6:14pm

This error is displayed when I am trying to accommodate the NER dataset specifically entities extracted in the fashion show here

sgugger · January 24, 2022, 2:10pm

It looks like you’re trying to use a very old API for data collators, which was deprecated in v3 and removed in v4. DefaultDataCollator() has not collate_batch method, you call it directly on the features you want to batch, and you should implement the __call__ method in your custom data collator.

Topic		Replies	Views
Multilabel token classification (dataloader issues) 🤗Datasets	0	178	April 20, 2024
Using Datasets, DataCollators and DataLoaders to create an NLP data pipeline 🤗Datasets	1	4999	June 21, 2023
How to use Data Collator? Beginners	1	2365	April 26, 2021
Issues with Data Collator and Tokenizing with NER Datasets 🤗Tokenizers	1	2502	May 9, 2022
How to use a data collator when dealing with text and images 🤗Transformers	0	1115	March 6, 2023

Using datacollator for multi-task training

Related topics