amateur data scientist here, have worked with nltk/spacy/gensim insofar as to do soft cosine similarity of text documents in the past. I’m working on a new project, where I’m trying to identify questions and answers in a 6 month chat history where there are multiple participants and multiple threads of conversation (some linked some not).
I’ve gotten question identification somewhat down, but am struggling to figure out a good approach to identifying answers. One approach I’ve thought about is through entity recognition clustering, though that seems a little too narrow/rigid.
I saw this adversarial_qa · Datasets at Hugging Face and was wondering if it might be applicable to my use case? If so, any suggestions on how to transfer learning train it on the domain of my corpus (which is mostly code and niche topic related text)?
Thanks in advance all
Hi @ilemi, if you’ve already got a set of questions nailed down, couldn’t you use a retriever like TF-IDF / BM25 / DPR to return candidate answers (i.e. passages of text that are indexed in some fashion)?
If yes, there’s a nice question-answering library called
haystack that provides various retrievers for you to play with: https://haystack.deepset.ai/docs/latest/retrievermd
Thanks for the reply @lewtun. I’ve never heard of a retriever before, and will do some digging on the resources you have shared.
My main concern is that most of the answers would be specific to that developer product, i.e. only found in product developer docs and discord/slack chat history. The answers are in the chat history (which I have), but I haven’t found an efficient way to identify answers or create question-answer pairs yet. I realize that may be less of an NLP problem but more of a matching algorithm or manual labelling for a different classification model. Another approach is trying named entity recognition and trying to find Q&A pairs through that. Any suggestions anyone may have on that are welcome too.
Ah I see, so the challenge is that you can have a question like “Does it run on Windows?” which could apply to multiple products (I’m assuming you can’t filter the passages by some product ID)
If you happen to know the sequence of questions in a given chat, I wonder whether you could do something simple like treating all responses between two questions as potential answers which you can then rank with something like TF-IDF
If not, then an alternative would be to label some data with e.g. “question” and “non-question” and train a simple classifier. NER could certainly help and for that there’s a nice tutorial here: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb#scrollTo=545PP3o8IrJV
I like both of those approaches! I will try them out and let you know if they work.
Thanks again @lewtun!
Hi @ilemi, i was wondering if we can chat about the problem you are solving. I am working on a similar project of creating the FAQ from the chat history:)