I have a sequence numerical dataset for classification task. my dataset in CSV format is like this:
first row: [[0,1,0],[5,0,1],[4,1,1]]=>target= 5
second row: [[1,2,0],…, [5,0,1]]=>target=2
third row:[[5,1,0],[5,0,2],[6,0,0],[1,2,0]]=>target=3
…
When I train the dataset using transformers, accuracy doesn’t raise more than 48%. I don’t know what is the problem. I have been using common tokenizers such as wordtokenizer or bert tokenizer, etc.
I’m not sure if i’m using a correct tokenizer or maybe maybe I need to do some preprocessing on my data.
Pleade guide me about it.
You’re dealing with numerical sequences, so typical NLP tokenizers like WordTokenizer or BERT tokenizer are not well-suited here.
Normalization: Your numerical data should be on a similar scale, typically between 0 and 1. This will help your model learn more efficiently. You can use Min-Max Scaling or Z-score normalization techniques for this.Class
Imbalance: If some target classes have significantly more instances than others, the model may become biased towards the majority class. You can try over-sampling the minority class or under-sampling the majority class to address this.
Sequence Length Uniformity: Since your sequences have varying lengths, padding or truncation will help to standardize them. This will ensure that your transformer model can handle the input effectively.
For numerical sequences, you don’t need these tokenizers. The numbers can be fed directly into the transformer after being normalized.
1 Like