Multi-input tag and ,multi-label output for token classification using Bert pretrained model

Hi, if I have a huggingface dataset where tokens are tagged with pos and possibly so that I have (token, pos, lemma, fine_pos, wordnet.word) and I want to produce for each token t labels like t1,t2, is there any tutorial or example of how to do it?
I apologize if this question is answered before, but I’ve been searching for an answer using transformers and pretrained bert and have not managed to find one.
Thank you.

1 Like

Hey!

It looks like you’re trying to label tokens in a Hugging Face dataset, such as tagging each token with multiple labels like t1, t2, alongside POS tags. You can achieve this using Hugging Face’s transformers library with a pre-trained model like BERT. Here’s an approach to guide you:

  1. Dataset Preparation:

    • Make sure your dataset is in a format where each token is associated with its label (e.g., POS tag, other labels).
    • You can use Hugging Face’s datasets library to load and manipulate your dataset. If you’re working with token-level labels, your dataset might look like this:
      {
          'tokens': ['I', 'am', 'happy'],
          'labels': ['PRON', 'VERB', 'ADJ']
      }
      
  2. Tokenization:

    • Use a tokenizer like BertTokenizer from the transformers library to split the text into tokens and match each token with its corresponding label.
    • Keep in mind that tokenization can split words into subwords, so you need to handle this by ensuring that each subword receives the correct label.
    • Here’s an example of tokenizing and aligning labels:
      from transformers import BertTokenizer
      
      tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
      
      def align_labels_with_tokens(tokens, labels):
          encoding = tokenizer(tokens, truncation=True, padding=True, is_split_into_words=True)
          word_ids = encoding.word_ids()  # Maps tokens to words
      
          # Align labels with the tokenized words
          aligned_labels = [-100 if word_id is None else labels[word_id] for word_id in word_ids]
          return encoding, aligned_labels
      
      tokens = ['I', 'am', 'happy']
      labels = ['PRON', 'VERB', 'ADJ']
      
      encoding, aligned_labels = align_labels_with_tokens(tokens, labels)
      print(aligned_labels)
      
  3. Using Pretrained BERT:

    • You can use a pre-trained BERT model for token classification (e.g., for named entity recognition or POS tagging) and fine-tune it with your labeled dataset.
    • Example for token classification:
      from transformers import BertForTokenClassification, Trainer, TrainingArguments
      
      model = BertForTokenClassification.from_pretrained('bert-base-uncased', num_labels=3)
      
      # Prepare your dataset for the Trainer
      train_dataset = Dataset.from_dict({
          'input_ids': encoding['input_ids'],
          'attention_mask': encoding['attention_mask'],
          'labels': aligned_labels
      })
      
      training_args = TrainingArguments(
          output_dir='./results',
          num_train_epochs=3,
          per_device_train_batch_size=8,
          logging_dir='./logs',
      )
      
      trainer = Trainer(
          model=model,
          args=training_args,
          train_dataset=train_dataset,
      )
      
      trainer.train()
      
  4. Multi-Labeling:

    • If you want to produce multiple labels for each token (e.g., t1, t2), you can modify the label alignment strategy by extending the model to predict multiple labels per token.
    • You could use a multi-label classification approach (e.g., using a sigmoid activation function) to predict multiple labels per token. This would require modifying the loss function and model architecture slightly.
  5. Resources:

    • Hugging Face provides a Token Classification tutorial which is a good starting point.
    • Look into datasets and metrics libraries for managing and evaluating your labeled dataset.

With this approach, you can train a pre-trained BERT model to output multiple labels per token and fine-tune it on your specific task. If you’re still having trouble, let me know, and I can help clarify further!

1 Like