I’m working on a NER project using chat data. I was wondering if there are any thoughts from the community on:
- dealing with often occuring (read: repeated automatic) sentences from helpdesk-employees (so containing NER) which could be like: My name is John Doe, My name is Peter, etc. Is it necessary to augment with not often (or names not occuring) occuring names or is this not really a necessity? (meaning a lot of these standard sentences of which the model could learn the position of where the name is in this case: too easy for model if exactly the same syntax is used by different helpdesk-employees => maybe undersample ?
- Given standard sentences (ngrams) exist you might also get overlap between train-test-validation sets if we do not (completely / partly) remove ngrams???
- dealing with standard sentences like ‘‘Our openings hours are from, etc. Or your client number consists of 9 numbers etc.’’. Would it make sense to delete these standard (and very often occuring sentences) from the chat itself for training a model? I mean it would alter the chat itself of course
- would it make more sense to train on the whole chat by chat itself or better on individual sentences (our thought is on chatlevel since it will provide more context then relatively short sentences or sometimes even answers which only contain postal code or an address)
- since we want to also include next to chat data (from our chatbot / interaction with agent) also whatsapp chat and facebook chat and the size of these datasources differ how would one go about in deciding how the division in the total dataset to choose (so percentage of distribution based on the absolute size or every souce distribution like 1/3, 1/3, 1/3 ?)
I am eager to see thoughts, feedback on these decisions or thought process / argumentation / experience that comes with it. Normally you don’t see a lot of discussions on setup and learnings / best practices.