Token Classification, KeyError: 'text' I've tried every combination of data, .cvs, .jsonl you can imagine

This case?


Problem, context, root cause

  • You are training a Token Classification (NER/POS) project in AutoTrain and the UI throws KeyError: 'text' during dataset preparation. The traceback shows text_column = self.column_mapping["text"] inside dataset.py, so the UI/route is trying to read a text key even though token tagging data uses tokens/tags. This is a known/replicated bug in the UI path; it has a public issue with the exact stack and multiple confirmations. (GitHub)
  • Correct Token Classification data ≠ Text Classification data. Token Classification expects per-token supervision: two aligned columns, lists of tokens and lists of tags, one list per example. In CSV those lists must be stringified; in JSONL they are regular JSON arrays. The official task page shows both CSV and JSONL templates and explicitly states that CSV lists must be stringified. (Hugging Face)
  • In the UI you must still provide a “column mapping” dictionary. For Token Classification that mapping is {"text": "tokens", "label": "tags"}—the “text” key in the mapping points to your tokens column, and “label” points to your tags column. The docs say exactly this and warn that token/tag lists must be lists of strings (stringified in CSV). The UI bug arises because some paths still dereference column_mapping["text"] assuming a raw text column. (Hugging Face)

What to do now (quick wins, in order)

  1. Use the correct data template (CSV or JSONL)
  • CSV (stringified lists):

    tokens,tags
    "['John','lives','in','Berlin']","['B-PER','O','O','B-LOC']"
    "['Acme','Corp','hired','Mary']","['B-ORG','I-ORG','O','B-PER']"
    
  • JSONL (arrays):

    {"tokens": ["John","lives","in","Berlin"], "tags": ["B-PER","O","O","B-LOC"]}
    {"tokens": ["Acme","Corp","hired","Mary"], "tags": ["B-ORG","I-ORG","O","B-PER"]}
    

The official page shows the same shapes and examples, including the CSV stringification requirement. (Hugging Face)

  1. Map columns correctly in the UI
  • Set the column mapping to exactly:

    {"text": "tokens", "label": "tags"}
    

This is the documented Token Classification mapping. It tells AutoTrain “my dataset column named tokens should be treated as the task’s logical text (token list), and tags is the task’s logical label.” (Hugging Face)

  1. If the UI still raises KeyError: 'text'
  • Minimal rename workaround that unblocks the buggy route: rename your token column to text and keep the tags column as tags; then set the mapping {"text": "text", "label": "tags"}. This appeases the hardcoded ["text"] lookup while preserving token-tag semantics. The public issue and thread match your stack and confirm it’s an interface bug, not a data problem. (GitHub)
  1. Robust alternative: use the CLI (bypasses the problematic UI path)
  • Prefer the documented subcommand for this task plus explicit column flags:

    # install/update
    pip install -U autotrain-advanced  # docs/pypi: see params + quickstart
    
    # train a token-classification project from local data
    autotrain token-classification --train \
      --project-name "my-ner" \
      --data-path ./data \
      --train-split train \
      --valid-split valid \
      --tokens-column tokens \
      --tags-column tags \
      --model bert-base-uncased
    

The quickstart lists the token-classification subcommand; the parameter reference defines --tokens-column and --tags-column (defaults tokens / tags). The forum reproductions show the same params working with the Python API. (Hugging Face)

Background you actually need (why the format matters)

  • Token Classification assigns a label to each token, typically in BIO/IOB2 schemes (e.g., B-PER, I-PER, O). Your tokens and tags lists must be aligned 1:1 for every row. The Transformers task guide explains the labeling scheme and the usual preprocessing caveats (subword alignment, -100 for ignored tokens). (Hugging Face)

Fast self-checks that catch most failures

  • Shape and parity: every row must satisfy len(tokens) == len(tags). (Hugging Face)
  • CSV quoting: in CSV, each cell is a single stringified list (use json.dumps to avoid quoting mistakes). (Hugging Face)
  • Mapping: in UI, {"text": "tokens", "label": "tags"}; for the rename workaround, {"text": "text", "label": "tags"}. (Hugging Face)
  • Try JSONL when in doubt: it avoids CSV quoting pitfalls; the official task page shows JSONL with arrays. (Hugging Face)

Minimal validation/convert snippets (safe defaults)

  • Validate and convert JSONL → CSV (stringified lists) before uploading to the UI:

    # refs:
    # - Token Classification data format & CSV stringification: https://huggingface.co/docs/autotrain/en/tasks/token_classification
    # - Column mapping for token classification: https://huggingface.co/docs/autotrain/col_map
    import json, pandas as pd
    
    # read jsonl
    df = pd.read_json("train.jsonl", lines=True)
    
    # basic checks
    assert {"tokens","tags"} <= set(df.columns)
    assert (df["tokens"].apply(len) == df["tags"].apply(len)).all()
    
    # stringify lists for CSV
    df_out = df.copy()
    df_out["tokens"] = df_out["tokens"].apply(json.dumps)
    df_out["tags"]   = df_out["tags"].apply(json.dumps)
    df_out.to_csv("train.csv", index=False)
    

    Use the same process for your validation split; keep column names identical across files. The official page stresses consistent column names and shows chunking large CSVs with pandas. (Hugging Face)

Common pitfalls and how to avoid them

  • Mismatched column mapping in the UI (e.g., mapping text → a raw sentence column for a token task). Use the token/tag mapping as documented. (Hugging Face)
  • CSV lists not actually stringified, or quotes/brackets mangled by Excel. Prefer JSONL or stringify with json.dumps. (Hugging Face)
  • Unequal list lengths between tokens and tags; even one bad row will break training. (Hugging Face)
  • Expecting integer tag IDs; AutoTrain’s CSV examples use string tag names. Keep tag lists as strings unless you know the trainer expects IDs. The Transformers recipe shows how label IDs are typically resolved from names. (Hugging Face)

Working end-to-end recipes (concise)
A) UI-only path (fastest when it works)

  1. Upload train.csv/valid.csv using the exact CSV template above.
  2. Set mapping {"text": "tokens", "label": "tags"}.
  3. If you still hit KeyError: 'text', quickly rename tokens → text in your files and set {"text": "text", "label": "tags"}. (Hugging Face)

B) CLI path (avoids the UI bug)

  1. Keep JSONL or CSV as shown.

  2. Run:

    autotrain token-classification --train \
      --data-path ./data \
      --train-split train \
      --valid-split valid \
      --tokens-column tokens \
      --tags-column tags
    

    Subcommand, splits, and column flags are all documented. (Hugging Face)

Short, curated references (why each is useful)

  • Data format + parameters (authoritative):
    • AutoTrain Token Classification task page: CSV/JSONL templates, stringified-lists requirement, and tokens_column/tags_column parameters. (Hugging Face)
    • Column Mapping guide: exact mapping for token tasks {"text":"tokens","label":"tags"}. (Hugging Face)
  • Bug confirmation (UI route reads text):
    • GitHub issue with your exact KeyError: 'text' stack. (GitHub)
    • HF Forums thread showing same error and a working Python/params workaround. (Hugging Face Forums)
  • Background on NER/token-labeling:
    • Transformers Token Classification guide: BIO tags, alignment, preprocessing. (Hugging Face)

TL;DR

  • Token Classification data must be two aligned columns/lists: tokens and tags. In CSV they must be stringified; in JSONL they are arrays. Map columns in the UI as {"text":"tokens", "label":"tags"}. If the UI still explodes with KeyError: 'text', either (a) temporarily rename your token column to text and map {"text":"text","label":"tags"}, or (b) train via the CLI and pass --tokens-column/--tags-column. All of this is straight from the official docs and the public bug report that matches your traceback. (Hugging Face)