Sequence numerical clasification

abcdef-higklmn · October 3, 2023, 5:59pm

I have a sequence numerical dataset for classification task. my dataset in CSV format is like this:
first row: [[0,1,0],[5,0,1],[4,1,1]]=>target= 5
second row: [[1,2,0],…, [5,0,1]]=>target=2
third row:[[5,1,0],[5,0,2],[6,0,0],[1,2,0]]=>target=3
…
When I train the dataset using transformers, accuracy doesn’t raise more than 48%. I don’t know what is the problem. I have been using common tokenizers such as wordtokenizer or bert tokenizer, etc.

I’m not sure if i’m using a correct tokenizer or maybe maybe I need to do some preprocessing on my data.
Pleade guide me about it.

Bjornedt · October 3, 2023, 11:32pm

You’re dealing with numerical sequences, so typical NLP tokenizers like WordTokenizer or BERT tokenizer are not well-suited here.

Normalization: Your numerical data should be on a similar scale, typically between 0 and 1. This will help your model learn more efficiently. You can use Min-Max Scaling or Z-score normalization techniques for this.Class

Imbalance: If some target classes have significantly more instances than others, the model may become biased towards the majority class. You can try over-sampling the minority class or under-sampling the majority class to address this.

Sequence Length Uniformity: Since your sequences have varying lengths, padding or truncation will help to standardize them. This will ensure that your transformer model can handle the input effectively.

For numerical sequences, you don’t need these tokenizers. The numbers can be fed directly into the transformer after being normalized.

Topic		Replies	Views
Sequence Classification -- Fine Tune? Beginners	3	3137	January 31, 2021
Appropriate tokenizer for particular dataset? 🤗Transformers	0	235	September 18, 2022
Train model from scratch on own dataset Beginners	0	575	February 26, 2024
Can BERT take numerical values as input for masked time-series modeling? Beginners	1	940	December 18, 2022
Using time series for SequenceClassification models 🤗Transformers	2	4246	September 7, 2022

Sequence numerical clasification

Related topics