I’m working in a Master’s Dissertation to predict music popularity using AST model.
I’m looking now at the ASTFeatureExtractor here: Audio Spectrogram Transformer that converts audio raw files to Mel spectrograms.
Looks like ‘max_length’ parameter of ASTFeatureExtractor default value is 1024. To me, 1024 means that only the first 10.24 seconds of each song will be inserted to the model. Anyone can confirm that?
I think it’s probably about right. Maybe changing the hop will make a difference.
n /your_dataset/run.sh, you need to specify the data json file path. You need to set dataset_mean and dataset_std, if don’t know, you can use our AudioSet stats (mean=-4.27, std=4.57); You need to set audio_length, which should be the number of frames (e.g., with a 10ms hop, 10-second audio=1000 frames); You need to set the metrics in [acc,mAP] and loss in [CE,BCE]; You need to set the inital learning rate lr and learning rate scheduler lrscheduler_{start,step,decay}; You also need to set the SpecAug parameters (freqm and timem, we recommend to mask 48 frequency bins out of 128, and 20% of your time frames), the mixup rate (i.e., how many samples are mixup samples), batch size, etc. While it seems a lot, it is easy if you start with one of our recipe: ast/egs/[audioset,esc50,speechcommands]/run.sh].