ASTFeatureExtractor

Hi,

I’m working in a Master’s Dissertation to predict music popularity using AST model.

I’m looking now at the ASTFeatureExtractor here: Audio Spectrogram Transformer that converts audio raw files to Mel spectrograms.

Looks like ‘max_length’ parameter of ASTFeatureExtractor default value is 1024. To me, 1024 means that only the first 10.24 seconds of each song will be inserted to the model. Anyone can confirm that?

Regards

1 Like

I think it’s probably about right. Maybe changing the hop will make a difference.

n /your_dataset/run.sh, you need to specify the data json file path. You need to set dataset_mean and dataset_std, if don’t know, you can use our AudioSet stats (mean=-4.27, std=4.57); You need to set audio_length, which should be the number of frames (e.g., with a 10ms hop, 10-second audio=1000 frames); You need to set the metrics in [acc,mAP] and loss in [CE,BCE]; You need to set the inital learning rate lr and learning rate scheduler lrscheduler_{start,step,decay}; You also need to set the SpecAug parameters (freqm and timem, we recommend to mask 48 frequency bins out of 128, and 20% of your time frames), the mixup rate (i.e., how many samples are mixup samples), batch size, etc. While it seems a lot, it is easy if you start with one of our recipe: ast/egs/[audioset,esc50,speechcommands]/run.sh].