Im looking for help on improving my understanding of how to do a custom Q&A answer finder without pipelines.
Using “distilbert-base-cased-distilled-squad” , which I understand to be pre-tuned.
Corpus is ~80 sentences (newline delim), naturally written “help” info.
code condensed
Start up the model/tokenizer:
c = AutoConfig.fpt(modelName)
c.num_labels = 2
m = AutoModelForQA.fpt(mn, config=c)
m.to(device)
t = AutoTokenizer.fpt(mn)
Chunk the corpus to for use by the model. Prevent skipping possible answers by windowing.
cs = [] #chunks
t = t.encode(context, add_special_tokens=False)
for i in range(0, len(tokens), cSize - overlapSize):
s_idx = i
e_idx = min(i + cSize, len(tokens))
c = tokens[s_idx: e_idx]
cs.append(c)
Fetch q&a pair–
I dont like using BatchEncoding to do this nor manually appending special tokens. I want to do this better/right
as = []
q_toks = t.encode(q, add_special_tokens=F)
for c in cs:
#101=[CLS] to start, 120=[SEP] to split
i_toks = [[101] + q_toks + [120] + c + [120]]
i = tokenization_utils_base.BatchEncoding(
data={'input_ids': torch.tensor(i_toks, device=d), 'attention_mask': torch.tensor([[1 for _ in i_toks[0]]], device=d)},
encoding=None, tensor_type=PT,
prepend_batch_axis=F,
n_sequences=None,
)
with torch.no_grad():
o = m(**i)
s_idx = torch.argmax(o.start_logits)
end_idx = torch.argmax(o.end_logits)
a = t.decode(i.input_ids[0,s_idx:e_idx+1])
answers.append({
"answer": answer,
"score": float(torch.max(torch.softmax(o.start_logits, dim=1))),
"log_score": 0,
})
for a in as:
a["log_score"] = round(math.log(answer["score"], 4), 5)
sa = sorted(as, onlogscore, reverse=True)
This “works.” However, the answers bubbled to the top from sort are often incorrect, and the “more correct” one is 3rd or 4th or not found.
I feel Im missing a step, filter, or I just don’t understand