Adding Audio-MAE to Transformers

Ahmed-Telili · January 21, 2025, 10:31am

Hello Hugging Face Community,

I have trained the Audio-MAE model on 30M underwater sound samples from the Orcasound dataset using 8 H100 GPUs for over a month. While exploring the Hugging Face Transformers library, I noticed that there is an implementation for Vision Transformer (ViT) MAE, but no implementation for Audio-MAE.

Given the growing interest in underwater audio-based models, I wanted to ask if it would be possible to add the Audio-MAE model to the Transformers library, pretrained on 30 million samples. I believe it could be valuable for various audio-related applications, especially in underwater acoustics.

Thank you in advance for your feedback!

Best regards,

Alanturner2 · January 21, 2025, 1:20pm

Hello @Ahmed-Telili !

Thank you for sharing your incredible work and for bringing this to the community’s attention! Training an Audio-MAE model on 30 million underwater sound samples is an impressive achievement, and your suggestion to add the model to the Hugging Face Transformers library is both valuable and exciting.

Here’s how you could move forward:

Open a Feature Request:
- You can create a feature request on the Hugging Face GitHub repository (Transformers Issues).
- In the request, include details about Audio-MAE, your training setup, dataset (Orcasound), and potential applications, emphasizing its value for underwater acoustics and other domains.
Prepare a Pretrained Model for Sharing:
- If you’re comfortable, consider uploading your pretrained model to the Hugging Face Model Hub. This would make it accessible to the community and encourage adoption.
- Use a descriptive README for your model card, including details like:
  - Training dataset and methodology.
  - Potential use cases (e.g., marine research, underwater sound monitoring).
  - Limitations or biases in the dataset.
Contribute an Implementation:
- If you’re open to contributing code, you can fork the Transformers repository and create an implementation for Audio-MAE, taking inspiration from the existing Vision Transformer MAE implementation.
- Add relevant documentation and tests to make the integration seamless.

This initiative could significantly benefit the audio research community, especially in niche domains like underwater acoustics. Kudos to you for leading the way, and I’m excited to see where this goes!

Best regards,
Alan.

Topic		Replies	Views
Multimodal architectures with HuggingFace transformers for speech and text 🤗Transformers	3	1132	November 14, 2022
Transformers 4.9.0 on SageMaker Amazon SageMaker	12	1968	March 25, 2022
Transformer/Pipelines Tutorial - where is it running? Beginners	3	1963	December 4, 2023
Inference with VitMAE by providing a mask 🤗Transformers	0	286	January 3, 2024
Open Source survey results [Jan 2022] Community Calls	1	2256	March 10, 2022

Adding Audio-MAE to Transformers

Related topics