Seeking Guidance on Creating and Training a Model with a Specific Dataset

Hello Hugging Face community,

I’m currently working on a project that involves creating and training a machine learning model using a unique dataset. I would greatly appreciate your expertise and guidance on how to tackle this task effectively.

Dataset Description:

  • I have a dataset that contains comments and segmented text. Each comment appears to be related to a specific topic or experience, and the segmented text seems to be a breakdown of the comment’s content.

Objective:

  • My main objective is to leverage this dataset for various natural language processing (NLP) tasks. However, I’m uncertain about the best approach and would love to hear your suggestions.

Specific Questions and Challenges:

  1. How can I preprocess and clean this dataset effectively for NLP tasks such as sentiment analysis or text segmentation?
  2. What models or architectures would you recommend for tasks like sentiment analysis or text segmentation?
  3. Are there any specific libraries or tools within the Hugging Face ecosystem that I should consider using for this project?
  4. Any best practices or tips for training on datasets with this kind of structure?
  5. What should I keep in mind while fine-tuning a model on this dataset?

Hey,
Let’s tackle it one by one.

For preprocessing the dataset there are a few things you could do like text normalisation, feature extraction and various other processes. The preprocessing methods vary for different NLP tasks so be sure to know what suits your purpose the best.

For sentiment analysis, I would recommend using BERT or distillBERT and for text segmentation you can use BiLSTM or transformers based models like GPT2.

From what I see the dataset seems to be comprehensive in terms of structure but for an effective training regime you should consider remove unnecessary punctuations, spaces and other anomalies. It would also be really helpful if you tokenize the dataset before training.

All the best!

But for text segmentation, I think that punctuations and other characters are needed

“remove unnecessary punctuations”

The middle word is the important bit.

Could you elaborate on this further?