KeyError: 'Invalid key. Only three types of key are available: (1) string, (2) integers for backend Encoding, and (3) slices for data subsetting.'

train_encoded_inputs = tokenizer(x_train[“TextData”].tolist(), padding=True, truncation=True, max_length=512, return_tensors=‘pt’)
test_encoded_inputs = tokenizer(x_test[“TextData”].tolist(), padding=True, truncation=True, max_length=512,return_tensors=‘pt’)
train_encoded_inputs
training_args = TrainingArguments(
output_dir=‘./results’, # sortie du modèle
learning_rate=2e-5,
num_train_epochs=3, # nombre d’époques
per_device_train_batch_size=4, # taille du lot d’entraînement
per_device_eval_batch_size=4, # taille du lot d’évaluation
warmup_steps=500, # nombre de warmup steps
weight_decay=0.01, # taux de décroissance du poids
logging_steps=10,
fp16=True
)
trainer = Trainer(
model=model, # le modèle à entraîner
args=training_args, # arguments de l’entraînement
train_dataset=train_encoded_inputs, # données d’entraînement
eval_dataset=test_encoded_inputs, # données de test
tokenizer=tokenizer,

) i recieve KeyError Traceback (most recent call last)
in <cell line: 1>()
----> 1 trainer.train()

7 frames
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py in getitem(self, item)
258 return {key: self.data[key][item] for key in self.data.keys()}
259 else:
→ 260 raise KeyError(
261 "Invalid key. Only three types of key are available: "
262 “(1) string, (2) integers for backend Encoding, and (3) slices for data subsetting.”

KeyError: ‘Invalid key. Only three types of key are available: (1) string, (2) integers for backend Encoding, and (3) slices for data subsetting.’

help me please !

Hi,

It looks like you’re providing inputs prepared by the tokenizer to the Trainer. This is not allowed, as it returns a dictionary (actually a BatchEncoding), you can only provide either a Hugging Face dataset or a PyTorch dataset to it.

See the example script here: transformers/examples/pytorch/text-classification/run_glue.py at e4628434d854ddfb5c002a6cc00b4eb4f22b7df2 · huggingface/transformers · GitHub

for me i work with wustl ehms dataset WUSTL EHMS 2020 Dataset for Internet of Medical Things (IoMT) Cybersecurity Research

x=dataset.iloc[:,0:38]
y=dataset.iloc[:,38]
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)
x_train[‘TextData’] = x_train.apply(lambda row: f"Sport: {row[‘Sport’]} Dport: {row[‘Dport’]} SrcBytes: {row[‘SrcBytes’]} DstBytes: {row[‘DstBytes’]} SrcLoad: {row[‘SrcLoad’]} DstLoad: {row[‘DstLoad’]} SrcGap: {row[‘SrcGap’]} DstGap: {row[‘DstGap’]} SIntPkt: {row[‘SIntPkt’]} DIntPkt: {row[‘DIntPkt’]} SIntPktAct: {row[‘SIntPktAct’]} DIntPktAct: {row[‘DIntPktAct’]} SrcJitter: {row[‘SrcJitter’]} DstJitter: {row[‘DstJitter’]} sMaxPktSz: {row[‘sMaxPktSz’]} dMaxPktSz: {row[‘dMaxPktSz’]} sMinPktSz: {row[‘sMinPktSz’]} dMinPktSz: {row[‘dMinPktSz’]} Dur: {row[‘Dur’]} Trans: {row[‘Trans’]} TotPkts: {row[‘TotPkts’]} TotBytes: {row[‘TotBytes’]} Load: {row[‘Load’]} Loss: {row[‘Loss’]} pLoss: {row[‘pLoss’]} pSrcLoss: {row[‘pSrcLoss’]} pDstLoss: {row[‘pDstLoss’]} Rate: {row[‘Rate’]} Packet_num: {row[‘Packet_num’]} Temp: {row[‘Temp’]} Temp: {row[‘Temp’]} SpO2: {row[‘SpO2’]} Pulse_Rate: {row[‘Pulse_Rate’]} SYS: {row[‘SYS’]} DIA: {row[‘DIA’]} Heart_rate: {row[‘Heart_rate’]} Resp_Rate: {row[‘Resp_Rate’]} ST: {row[‘ST’]} Attack Category: {row[‘Attack Category’]}“, axis=1)
x_test[‘TextData’] = x_test.apply(lambda row: f"Sport: {row[‘Sport’]} Dport: {row[‘Dport’]} SrcBytes: {row[‘SrcBytes’]} DstBytes: {row[‘DstBytes’]} SrcLoad: {row[‘SrcLoad’]} DstLoad: {row[‘DstLoad’]} SrcGap: {row[‘SrcGap’]} DstGap: {row[‘DstGap’]} SIntPkt: {row[‘SIntPkt’]} DIntPkt: {row[‘DIntPkt’]} SIntPktAct: {row[‘SIntPktAct’]} DIntPktAct: {row[‘DIntPktAct’]} SrcJitter: {row[‘SrcJitter’]} DstJitter: {row[‘DstJitter’]} sMaxPktSz: {row[‘sMaxPktSz’]} dMaxPktSz: {row[‘dMaxPktSz’]} sMinPktSz: {row[‘sMinPktSz’]} dMinPktSz: {row[‘dMinPktSz’]} Dur: {row[‘Dur’]} Trans: {row[‘Trans’]} TotPkts: {row[‘TotPkts’]} TotBytes: {row[‘TotBytes’]} Load: {row[‘Load’]} Loss: {row[‘Loss’]} pLoss: {row[‘pLoss’]} pSrcLoss: {row[‘pSrcLoss’]} pDstLoss: {row[‘pDstLoss’]} Rate: {row[‘Rate’]} Packet_num: {row[‘Packet_num’]} Temp: {row[‘Temp’]} Temp: {row[‘Temp’]} SpO2: {row[‘SpO2’]} Pulse_Rate: {row[‘Pulse_Rate’]} SYS: {row[‘SYS’]} DIA: {row[‘DIA’]} Heart_rate: {row[‘Heart_rate’]} Resp_Rate: {row[‘Resp_Rate’]} ST: {row[‘ST’]} Attack Category: {row[‘Attack Category’]}”, axis=1)

train_encoded_inputs = tokenizer(x_train[“TextData”].tolist(), padding=True, truncation=True, max_length=512, return_tensors=‘pt’)

test_encoded_inputs = tokenizer(x_test[“TextData”].tolist(), padding=True, truncation=True, max_length=512,return_tensors=‘pt’)

train_encoded_inputs

so idon’t have to applied a tokenizer ?

can you help me please !

You have to apply the tokenizer, but the train_dataset argument needs to be a PyTorch dataset or Hugging Face dataset.

but i would work with this dataset WUSTL EHMS 2020 Dataset for Internet of Medical Things (IoMT) Cybersecurity Research how can i do please?