NER chat project setup

RuudVelo · April 14, 2022, 3:09pm

I’m working on a NER project using chat data. I was wondering if there are any thoughts from the community on:

dealing with often occuring (read: repeated automatic) sentences from helpdesk-employees (so containing NER) which could be like: My name is John Doe, My name is Peter, etc. Is it necessary to augment with not often (or names not occuring) occuring names or is this not really a necessity? (meaning a lot of these standard sentences of which the model could learn the position of where the name is in this case: too easy for model if exactly the same syntax is used by different helpdesk-employees => maybe undersample ?
Given standard sentences (ngrams) exist you might also get overlap between train-test-validation sets if we do not (completely / partly) remove ngrams???
dealing with standard sentences like ‘‘Our openings hours are from, etc. Or your client number consists of 9 numbers etc.’’. Would it make sense to delete these standard (and very often occuring sentences) from the chat itself for training a model? I mean it would alter the chat itself of course
would it make more sense to train on the whole chat by chat itself or better on individual sentences (our thought is on chatlevel since it will provide more context then relatively short sentences or sometimes even answers which only contain postal code or an address)
since we want to also include next to chat data (from our chatbot / interaction with agent) also whatsapp chat and facebook chat and the size of these datasources differ how would one go about in deciding how the division in the total dataset to choose (so percentage of distribution based on the absolute size or every souce distribution like 1/3, 1/3, 1/3 ?)

I am eager to see thoughts, feedback on these decisions or thought process / argumentation / experience that comes with it. Normally you don’t see a lot of discussions on setup and learnings / best practices.

Topic		Replies	Views
Questions about ordering training inputs when fine-tuning models Beginners	5	2473	December 4, 2023
Seeking Advice on Processing Support Conversations for Efficient RAG Model Search Intermediate	0	50	September 9, 2024
NER for short technical phrases Models	0	603	December 16, 2020
Fine-Tuning + RAG based Chatbot: Dataset Structure & Instruction Adherence Issues Intermediate	7	367	March 11, 2025
NER: Data augmentation by replacing some words with random string? Beginners	0	267	August 18, 2021

NER chat project setup

Related topics