Wav2vec For Music Applications (generation, captioning, instrument classification)

Wav2vec is SoTA for ASR. It will be interesting to explore the model or other transformer architectures for Music AI applications. We can focus on of the following depending on bandwidth:
Instrument classification
Vocal separation or instrument segmentation
Emotion/rhythm/pitch analysis
Pitch shift

One thing I am curious about is:
If we train a music to lyrics model, what are the features being learnt by the layers? Can we fine-tune such a model on the other downstream tasks?


Hey @tanmaylaud,

This paper: [2105.01051] SUPERB: Speech processing Universal PERformance Benchmark might be of interest. I’m also currently working on adding Wav2Vec2 in Flax, see: [WIP][Flax] Add wav2vec2 by patrickvonplaten · Pull Request #12271 · huggingface/transformers · GitHub. Hopefully this will be merged by next week.


Hi, I recently ran a semester project in musicology and tried to use wav2vec for music popularity prediction. Unfortunately, without success, but maybe you can find the code useful for other purposes: DH-401/milestone3-wav2vec2.ipynb at main · Glorf/DH-401 · GitHub (project description is in the repo)