Wav2vec is SoTA for ASR. It will be interesting to explore the model or other transformer architectures for Music AI applications. We can focus on of the following depending on bandwidth:
Vocal separation or instrument segmentation
One thing I am curious about is:
If we train a music to lyrics model, what are the features being learnt by the layers? Can we fine-tune such a model on the other downstream tasks?