"KeyError: 'text' text_column = self.column_mapping["text"]"

The 500 error likely stems from a problem in either the preprocessing, model setup, or interaction with your environment. Here’s a checklist and suggestions to debug and resolve the issue:


1. Dataset Format

  • Check tokenization:
    • Ensure that the tokens and tags in the CSV files are properly formatted as lists.
    • Double-check that the quotes (") around tokens and tags are not interfering with reading the file as lists in Python.
  • Recommended Fix: Instead of storing lists as strings in CSV, store them as lists in JSONL (JSON Lines) format. Example:

json

Copy code

{"tokens": ["ist", "lebt", "Herr", "Berlin", "030", "Siemens", ".", "E-Mail-Adresse", "Telefonnummer"], "tags": ["O", "O", "O", "LOCATION", "PHONE_NUMBER", "ORGANIZATION", "O", "O", "O", "PHONE_NUMBER"]}

Use JSONL for better compatibility with frameworks like Hugging Face.


2. Loading the Dataset

If you’re using Hugging Face’s datasets library, ensure the dataset is correctly loaded. For a CSV:

python

Copy code

from datasets import load_dataset

data_files = {"train": "train.csv", "validation": "validate.csv"}
dataset = load_dataset("csv", data_files=data_files)

If your tokens and tags are strings, you may need to parse them:

python

Copy code

def preprocess_data(example):
    example["tokens"] = eval(example["tokens"])  # Convert string to list
    example["tags"] = eval(example["tags"])
    return example

dataset = dataset.map(preprocess_data)
1 Like