How to dealing with Data Imbalance

I want to fine tune pre-trained Roberta or Electra for multiclass classification (sentiment classify) imbalance data set . How handling problem ??

For class imbalance, one aspect to consider is that each batch has enough signal to provide some coverage of all the classes, even the unbalanced ones. Otherwise, it may degenerate during training.

When evaluating test performance though, you will need to keep the real proportions as you would observe in the real world.

I use a quick snippet to get the class distribution and pass that into the class weights.

from sklearn.utils import class_weight
class_weights = dict(enumerate(class_weight.compute_class_weight('balanced',
                                                         classes=np.unique(outputs),
                                                         y=outputs)))


history = nlp_model.fit( x_train, y_train, 
                                     batch_size=self.batch_size, 
                                     epochs=epochs,
                                     class_weight=class_weights,
                                     callbacks=self.callbacks,
                                     shuffle=True,
                                     validation_data = (x_test, y_test))
1 Like