Seeking Guidance on Creating and Training a Model with a Specific Dataset

sguaro · January 22, 2024, 5:23am

Hello Hugging Face community,

I’m currently working on a project that involves creating and training a machine learning model using a unique dataset. I would greatly appreciate your expertise and guidance on how to tackle this task effectively.

Dataset Description:

I have a dataset that contains comments and segmented text. Each comment appears to be related to a specific topic or experience, and the segmented text seems to be a breakdown of the comment’s content.

Objective:

My main objective is to leverage this dataset for various natural language processing (NLP) tasks. However, I’m uncertain about the best approach and would love to hear your suggestions.

Specific Questions and Challenges:

How can I preprocess and clean this dataset effectively for NLP tasks such as sentiment analysis or text segmentation?
What models or architectures would you recommend for tasks like sentiment analysis or text segmentation?
Are there any specific libraries or tools within the Hugging Face ecosystem that I should consider using for this project?
Any best practices or tips for training on datasets with this kind of structure?
What should I keep in mind while fine-tuning a model on this dataset?

AbishekSundar · January 24, 2024, 6:59am

Hey,
Let’s tackle it one by one.

For preprocessing the dataset there are a few things you could do like text normalisation, feature extraction and various other processes. The preprocessing methods vary for different NLP tasks so be sure to know what suits your purpose the best.

For sentiment analysis, I would recommend using BERT or distillBERT and for text segmentation you can use BiLSTM or transformers based models like GPT2.

From what I see the dataset seems to be comprehensive in terms of structure but for an effective training regime you should consider remove unnecessary punctuations, spaces and other anomalies. It would also be really helpful if you tokenize the dataset before training.

All the best!

sguaro · January 25, 2024, 2:14am

But for text segmentation, I think that punctuations and other characters are needed

jam4t · January 29, 2024, 3:12pm

“remove unnecessary punctuations”

The middle word is the important bit.

sguaro · February 2, 2024, 1:03am

Could you elaborate on this further?

Topic		Replies	Views
Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering 🤗Transformers	19	12849	February 12, 2024
Dataset curation extra parameters Beginners	2	31	January 19, 2025
Dataset Preparation for Q&A FineTuning Beginners	0	453	September 28, 2023
Defining a custom dataset for fine-tuning translation Beginners	4	5088	July 10, 2021
Data preprocessing steps for pretraining BERT from scratch Beginners	1	3878	January 30, 2022

Seeking Guidance on Creating and Training a Model with a Specific Dataset

Related topics