Trying to build a Q&A bot, got stuck at trainer.train()

Hi All,

Let me first explain what I am trying to accomplish: I am trying to build a helper bot for our users, that can answer questions like: “How do I connect to the system from remote” or “How do I am billed for my HPC jobs?”. The bot should be able to access local knowledge sources such as our Wiki or some microservices that can query data about the user’s storage usage, etc, and provide fact-accurate answers.

OK, So I am trying to really understand how to do it, not just copy and paste code, so I am going over the Hugging face training sessions, and so fart I came up with the code that you can see below. However, when I run it, I get the following error ( when it executes trainer.train() ):

  File "/cs/system/oshani/PycharmProjects/huggingface/venv/lib/python3.9/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 135, in forward
    embeddings = input_embeds + position_embeddings  # (bs, max_seq_length, dim)
RuntimeError: The size of tensor a (1528) must match the size of tensor b (512) at non-singleton dimension 1

So, am I in the right direction at all? Can you tell me what I am missing?

Many thanks,

Oren

----------- My code is below ----

from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import pipeline
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
from torch.optim import AdamW

raw_datasets = load_dataset("text", data_files="https://wiki.cs.huji.ac.il/wiki/Connecting_Remotely")

print(raw_datasets)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
dataset = raw_datasets.map(lambda examples: tokenizer(examples["text"]), batched=True)
dataset_train = dataset["train"]
tokenized_data = tokenizer(dataset_train["text"], return_tensors="np", padding=True)
tokenized_dict = dict(tokenized_data)

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")
training_args = TrainingArguments(
    output_dir="my_awesome_qa_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_train,
    tokenizer=tokenizer,
)
trainer.train()

qa_pipeline = pipeline("question-answering",model=model)
question = "How do I use a jump server to ssh to the system from home?"

answer = qa_pipeline(question=question, context='\n'.join(dataset_train["text"]))

print (answer)