Let me first explain what I am trying to accomplish: I am trying to build a helper bot for our users, that can answer questions like: “How do I connect to the system from remote” or “How do I am billed for my HPC jobs?”. The bot should be able to access local knowledge sources such as our Wiki or some microservices that can query data about the user’s storage usage, etc, and provide fact-accurate answers.
OK, So I am trying to really understand how to do it, not just copy and paste code, so I am going over the Hugging face training sessions, and so fart I came up with the code that you can see below. However, when I run it, I get the following error ( when it executes trainer.train() ):
File "/cs/system/oshani/PycharmProjects/huggingface/venv/lib/python3.9/site-packages/transformers/models/distilbert/modeling_distilbert.py", line 135, in forward embeddings = input_embeds + position_embeddings # (bs, max_seq_length, dim) RuntimeError: The size of tensor a (1528) must match the size of tensor b (512) at non-singleton dimension 1
So, am I in the right direction at all? Can you tell me what I am missing?
----------- My code is below ----
from datasets import load_dataset from transformers import AutoTokenizer from transformers import pipeline from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer from torch.optim import AdamW raw_datasets = load_dataset("text", data_files="https://wiki.cs.huji.ac.il/wiki/Connecting_Remotely") print(raw_datasets) tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") dataset = raw_datasets.map(lambda examples: tokenizer(examples["text"]), batched=True) dataset_train = dataset["train"] tokenized_data = tokenizer(dataset_train["text"], return_tensors="np", padding=True) tokenized_dict = dict(tokenized_data) model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased") training_args = TrainingArguments( output_dir="my_awesome_qa_model", evaluation_strategy="epoch", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=16, num_train_epochs=3, weight_decay=0.01 ) trainer = Trainer( model=model, args=training_args, train_dataset=dataset_train, tokenizer=tokenizer, ) trainer.train() qa_pipeline = pipeline("question-answering",model=model) question = "How do I use a jump server to ssh to the system from home?" answer = qa_pipeline(question=question, context='\n'.join(dataset_train["text"])) print (answer)