Train model from scratch on own dataset

waxef · February 26, 2024, 9:30pm

Hi folks,
I’m trying to train a couple of models on a sequence classification task.
I’ve been following the tutorial but I still have some questions cause the data is not in text format.

First, my data are already in numeric values, for instance sequences of varying lengths in the following format {inputs: [01,1,0,1,1,1,0,1,1,0,1,1,0], labels: 0}`.

What/How should I use the tokenizer in this case, do I just do padding of sequences?
Any MWE would be appreciated

I’ve read in the docs that for this kind of task the bert style of models are more suited, could anyone list a couple of models here to try that would work for this task.

I want to train from scratch and not fine-tune any pretrained model?

Topic		Replies	Views
What are the equivalent manner for using texts_to_sequences? 🤗Tokenizers	0	647	December 29, 2021
How to do sequence fine tuning? Beginners	5	759	July 22, 2020
Bypassing tokenizers 🤗Tokenizers	2	415	November 23, 2020
Do you need to use the associated tokenizer Beginners	2	578	June 6, 2022
Failing to format sentiment140 for Trainer Beginners	2	562	July 22, 2020

Train model from scratch on own dataset

Related topics