Matching Questions with Answers in Chat Text with multiple threads

ilemi · March 11, 2021, 4:34pm

Hey everyone!

amateur data scientist here, have worked with nltk/spacy/gensim insofar as to do soft cosine similarity of text documents in the past. I’m working on a new project, where I’m trying to identify questions and answers in a 6 month chat history where there are multiple participants and multiple threads of conversation (some linked some not).

I’ve gotten question identification somewhat down, but am struggling to figure out a good approach to identifying answers. One approach I’ve thought about is through entity recognition clustering, though that seems a little too narrow/rigid.

I saw this adversarial_qa · Datasets at Hugging Face and was wondering if it might be applicable to my use case? If so, any suggestions on how to transfer learning train it on the domain of my corpus (which is mostly code and niche topic related text)?

Thanks in advance all

lewtun · March 11, 2021, 8:15pm

Hi @ilemi, if you’ve already got a set of questions nailed down, couldn’t you use a retriever like TF-IDF / BM25 / DPR to return candidate answers (i.e. passages of text that are indexed in some fashion)?

If yes, there’s a nice question-answering library called haystack that provides various retrievers for you to play with: https://haystack.deepset.ai/docs/latest/retrievermd

ilemi · March 11, 2021, 8:30pm

Thanks for the reply @lewtun. I’ve never heard of a retriever before, and will do some digging on the resources you have shared.

My main concern is that most of the answers would be specific to that developer product, i.e. only found in product developer docs and discord/slack chat history. The answers are in the chat history (which I have), but I haven’t found an efficient way to identify answers or create question-answer pairs yet. I realize that may be less of an NLP problem but more of a matching algorithm or manual labelling for a different classification model. Another approach is trying named entity recognition and trying to find Q&A pairs through that. Any suggestions anyone may have on that are welcome too.

lewtun · March 11, 2021, 9:37pm

Ah I see, so the challenge is that you can have a question like “Does it run on Windows?” which could apply to multiple products (I’m assuming you can’t filter the passages by some product ID)

If you happen to know the sequence of questions in a given chat, I wonder whether you could do something simple like treating all responses between two questions as potential answers which you can then rank with something like TF-IDF

If not, then an alternative would be to label some data with e.g. “question” and “non-question” and train a simple classifier. NER could certainly help and for that there’s a nice tutorial here: https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/token_classification.ipynb#scrollTo=545PP3o8IrJV

ilemi · March 11, 2021, 9:44pm

I like both of those approaches! I will try them out and let you know if they work.

Thanks again @lewtun!

vnovosad · June 8, 2022, 12:04pm

Hi @ilemi, i was wondering if we can chat about the problem you are solving. I am working on a similar project of creating the FAQ from the chat history:)

Topic		Replies	Views
Retrieval by question-answer similarity Beginners	0	317	February 10, 2023
Sentence similarity Beginners	1	945	September 16, 2021
Stuck! Any help or tips? (School Chatbot) Beginners	0	237	March 8, 2023
Domain-specific word similarity problem Awesome paper	2	846	July 19, 2023
Repost: Wikipedia (or something else) text to input output Beginners	3	273	November 18, 2024

Matching Questions with Answers in Chat Text with multiple threads

Related topics