I am training ncbi_disease dataset using transformers trainer.
Here are the features of the dataset as follows
DatasetDict({
train: Dataset({
features: [āidā, ātokensā, āner_tagsā],
num_rows: 5433
})
validation: Dataset({
features: [āidā, ātokensā, āner_tagsā],
num_rows: 924
})
test: Dataset({
features: [āidā, ātokensā, āner_tagsā],
num_rows: 941
})
})
This is an output to a sample of training data tuple as follows
{'id': '20',
'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 0, 0],
'tokens': ['For',
'both',
'sexes',
'combined',
',',
'the',
'penetrances',
'at',
'age',
'60',
'years',
'for',
'all',
'cancers',
'and',
'for',
'colorectal',
'cancer',
'were',
'0',
'.']}
Here is the function for tokenization and I get this error
ArrowInvalid: Column 1 named id expected length 512 but got length 1000
def tokenize_text(examples):
return tokenizer(str(examples["tokens"]),truncation=True,max_length=512)
dataset=dataset.map(tokenize_text,batched=True)
Any clue how to solve this problem?