Trying to build a Q&A bot, got stuck at trainer.train()

oshani · May 22, 2023, 5:28am

Hi All,

Let me first explain what I am trying to accomplish: I am trying to build a helper bot for our users, that can answer questions like: “How do I connect to the system from remote” or “How do I am billed for my HPC jobs?”. The bot should be able to access local knowledge sources such as our Wiki or some microservices that can query data about the user’s storage usage, etc, and provide fact-accurate answers.

OK, So I am trying to really understand how to do it, not just copy and paste code, so I am going over the Hugging face training sessions, and so fart I came up with the code that you can see below. However, when I run it, I get the following error ( when it executes trainer.train() ):

  File "/cs/system/oshani/PycharmProjects/huggingface/venv/lib/python3.9/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 135, in forward
    embeddings = input_embeds + position_embeddings  # (bs, max_seq_length, dim)
RuntimeError: The size of tensor a (1528) must match the size of tensor b (512) at non-singleton dimension 1

So, am I in the right direction at all? Can you tell me what I am missing?

Many thanks,

Oren

----------- My code is below ----

from datasets import load_dataset
from transformers import AutoTokenizer
from transformers import pipeline
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
from torch.optim import AdamW

raw_datasets = load_dataset("text", data_files="https://wiki.cs.huji.ac.il/wiki/Connecting_Remotely")

print(raw_datasets)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
dataset = raw_datasets.map(lambda examples: tokenizer(examples["text"]), batched=True)
dataset_train = dataset["train"]
tokenized_data = tokenizer(dataset_train["text"], return_tensors="np", padding=True)
tokenized_dict = dict(tokenized_data)

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")
training_args = TrainingArguments(
    output_dir="my_awesome_qa_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_train,
    tokenizer=tokenizer,
)
trainer.train()

qa_pipeline = pipeline("question-answering",model=model)
question = "How do I use a jump server to ssh to the system from home?"

answer = qa_pipeline(question=question, context='\n'.join(dataset_train["text"]))

print (answer)

Topic		Replies	Views
Using huggingface transformers trainer method for hugging face datasets 🤗Datasets	1	1096	April 15, 2024
Importing a DistilBertTokenizer does not work using AutoTokenizer Beginners	0	651	November 8, 2023
Expected input batch_size (2048) to match target batch_size (4) Beginners	3	1602	May 23, 2022
What is wrong with my code Beginners	0	40	October 22, 2024
RuntimeError: The expanded size of the tensor (31) must match the existing size (7) at non-singleton dimension 0. Target sizes: [31]. Tensor sizes: [7] Beginners	0	184	May 23, 2024

Trying to build a Q&A bot, got stuck at trainer.train()

Related topics