How to use I-JEPA for image classficiation

Hello everyone,

I am currently trying to implement image classification using I-JEPA. The paper from Yann Lecun ([2111.06377] Masked Autoencoders Are Scalable Vision Learners) mentioned that it could be applied for image classification, which piqued my interest in exploring its usage for the same. However, I’m facing a bit of confusion when it comes to actual implementation.

From the repository provided on GitHub, I am finding it hard to understand how to modify the model to add a linear classifier. More so, I am unclear about how to re-train the model on my data. The pre-trained models available on their GitHub are also present, but I must admit, I am finding it difficult to grasp how to leverage these for my purpose.

Could anyone who has some experience with I-JEPA help me understand the process? Any guidance on how to adapt the model for image classification, and potentially how to use the pre-trained models, would be extremely appreciated.

Looking forward to your suggestions and guidance. Thanks in advance!


Hi Nathan,

Did you figure this out. I have been thinking about doing the same and have finally come round to actually attempting this. have you tested this?


The paper you refer to (MAE, or masked autoencoders) is available as ViTMAEForImageClassification in the Transformers library. It adds a linear classifier on top of the base ViTMAEModel. There’s also the ViTMAEForPreTraining class which adds the decoder used for pre-training.

Refer to the docs: ViTMAE. You can fine-tune it easily on your custom dataset by following the image classification notebook or example scripts.

I-JEPA is a different paper: [2301.08243] Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture, but the architecture is very similar (namely, a Vision Transformer). One would need to add I-JEPA as a separate model in the Transformers library.

How would you go about adding it to the library? I’d love to do that. @nielsr

I have managed to add a classification layer on top of the I_JEPA encoder, but it seems to not be very accurate and takes a while to train for now. I need to play with the hyperparameters a bit, but still find it strange that the out of the box classification abilities are so limited.