This case?
Problem, context, root cause
- You are training a Token Classification (NER/POS) project in AutoTrain and the UI throws
KeyError: 'text'during dataset preparation. The traceback showstext_column = self.column_mapping["text"]insidedataset.py, so the UI/route is trying to read atextkey even though token tagging data usestokens/tags. This is a known/replicated bug in the UI path; it has a public issue with the exact stack and multiple confirmations. (GitHub) - Correct Token Classification data â Text Classification data. Token Classification expects per-token supervision: two aligned columns, lists of tokens and lists of tags, one list per example. In CSV those lists must be stringified; in JSONL they are regular JSON arrays. The official task page shows both CSV and JSONL templates and explicitly states that CSV lists must be stringified. (Hugging Face)
- In the UI you must still provide a âcolumn mappingâ dictionary. For Token Classification that mapping is
{"text": "tokens", "label": "tags"}âthe âtextâ key in the mapping points to yourtokenscolumn, and âlabelâ points to yourtagscolumn. The docs say exactly this and warn that token/tag lists must be lists of strings (stringified in CSV). The UI bug arises because some paths still dereferencecolumn_mapping["text"]assuming a rawtextcolumn. (Hugging Face)
What to do now (quick wins, in order)
- Use the correct data template (CSV or JSONL)
-
CSV (stringified lists):
tokens,tags "['John','lives','in','Berlin']","['B-PER','O','O','B-LOC']" "['Acme','Corp','hired','Mary']","['B-ORG','I-ORG','O','B-PER']" -
JSONL (arrays):
{"tokens": ["John","lives","in","Berlin"], "tags": ["B-PER","O","O","B-LOC"]} {"tokens": ["Acme","Corp","hired","Mary"], "tags": ["B-ORG","I-ORG","O","B-PER"]}
The official page shows the same shapes and examples, including the CSV stringification requirement. (Hugging Face)
- Map columns correctly in the UI
-
Set the column mapping to exactly:
{"text": "tokens", "label": "tags"}
This is the documented Token Classification mapping. It tells AutoTrain âmy dataset column named tokens should be treated as the taskâs logical text (token list), and tags is the taskâs logical label.â (Hugging Face)
- If the UI still raises
KeyError: 'text'
- Minimal rename workaround that unblocks the buggy route: rename your token column to
textand keep the tags column astags; then set the mapping{"text": "text", "label": "tags"}. This appeases the hardcoded["text"]lookup while preserving token-tag semantics. The public issue and thread match your stack and confirm itâs an interface bug, not a data problem. (GitHub)
- Robust alternative: use the CLI (bypasses the problematic UI path)
-
Prefer the documented subcommand for this task plus explicit column flags:
# install/update pip install -U autotrain-advanced # docs/pypi: see params + quickstart # train a token-classification project from local data autotrain token-classification --train \ --project-name "my-ner" \ --data-path ./data \ --train-split train \ --valid-split valid \ --tokens-column tokens \ --tags-column tags \ --model bert-base-uncased
The quickstart lists the token-classification subcommand; the parameter reference defines --tokens-column and --tags-column (defaults tokens / tags). The forum reproductions show the same params working with the Python API. (Hugging Face)
Background you actually need (why the format matters)
- Token Classification assigns a label to each token, typically in BIO/IOB2 schemes (e.g.,
B-PER,I-PER,O). Yourtokensandtagslists must be aligned 1:1 for every row. The Transformers task guide explains the labeling scheme and the usual preprocessing caveats (subword alignment,-100for ignored tokens). (Hugging Face)
Fast self-checks that catch most failures
- Shape and parity: every row must satisfy
len(tokens) == len(tags). (Hugging Face) - CSV quoting: in CSV, each cell is a single stringified list (use
json.dumpsto avoid quoting mistakes). (Hugging Face) - Mapping: in UI,
{"text": "tokens", "label": "tags"}; for the rename workaround,{"text": "text", "label": "tags"}. (Hugging Face) - Try JSONL when in doubt: it avoids CSV quoting pitfalls; the official task page shows JSONL with arrays. (Hugging Face)
Minimal validation/convert snippets (safe defaults)
-
Validate and convert JSONL â CSV (stringified lists) before uploading to the UI:
# refs: # - Token Classification data format & CSV stringification: https://huggingface.co/docs/autotrain/en/tasks/token_classification # - Column mapping for token classification: https://huggingface.co/docs/autotrain/col_map import json, pandas as pd # read jsonl df = pd.read_json("train.jsonl", lines=True) # basic checks assert {"tokens","tags"} <= set(df.columns) assert (df["tokens"].apply(len) == df["tags"].apply(len)).all() # stringify lists for CSV df_out = df.copy() df_out["tokens"] = df_out["tokens"].apply(json.dumps) df_out["tags"] = df_out["tags"].apply(json.dumps) df_out.to_csv("train.csv", index=False)Use the same process for your validation split; keep column names identical across files. The official page stresses consistent column names and shows chunking large CSVs with pandas. (Hugging Face)
Common pitfalls and how to avoid them
- Mismatched column mapping in the UI (e.g., mapping
textâ a raw sentence column for a token task). Use the token/tag mapping as documented. (Hugging Face) - CSV lists not actually stringified, or quotes/brackets mangled by Excel. Prefer JSONL or stringify with
json.dumps. (Hugging Face) - Unequal list lengths between
tokensandtags; even one bad row will break training. (Hugging Face) - Expecting integer tag IDs; AutoTrainâs CSV examples use string tag names. Keep tag lists as strings unless you know the trainer expects IDs. The Transformers recipe shows how label IDs are typically resolved from names. (Hugging Face)
Working end-to-end recipes (concise)
A) UI-only path (fastest when it works)
- Upload
train.csv/valid.csvusing the exact CSV template above. - Set mapping
{"text": "tokens", "label": "tags"}. - If you still hit
KeyError: 'text', quickly renametokensâtextin your files and set{"text": "text", "label": "tags"}. (Hugging Face)
B) CLI path (avoids the UI bug)
-
Keep JSONL or CSV as shown.
-
Run:
autotrain token-classification --train \ --data-path ./data \ --train-split train \ --valid-split valid \ --tokens-column tokens \ --tags-column tagsSubcommand, splits, and column flags are all documented. (Hugging Face)
Short, curated references (why each is useful)
- Data format + parameters (authoritative):
⢠AutoTrain Token Classification task page: CSV/JSONL templates, stringified-lists requirement, andtokens_column/tags_columnparameters. (Hugging Face)
⢠Column Mapping guide: exact mapping for token tasks{"text":"tokens","label":"tags"}. (Hugging Face) - Bug confirmation (UI route reads
text):
⢠GitHub issue with your exactKeyError: 'text'stack. (GitHub)
⢠HF Forums thread showing same error and a working Python/params workaround. (Hugging Face Forums) - Background on NER/token-labeling:
⢠Transformers Token Classification guide: BIO tags, alignment, preprocessing. (Hugging Face)
TL;DR
- Token Classification data must be two aligned columns/lists:
tokensandtags. In CSV they must be stringified; in JSONL they are arrays. Map columns in the UI as{"text":"tokens", "label":"tags"}. If the UI still explodes withKeyError: 'text', either (a) temporarily rename your token column totextand map{"text":"text","label":"tags"}, or (b) train via the CLI and pass--tokens-column/--tags-column. All of this is straight from the official docs and the public bug report that matches your traceback. (Hugging Face)