I am a beginner in Huggingface, but I really enjoy the Huggingface community. I have a question about how to use huggingface to train a combination model. For example, I want to train an image captioning model. But I want to extract the Sam and clip features of an image and input them into GPT2 (or Llama 3.1) to generate a natural language description(or classification task). Can someone here tell me how to build code with Huggingface?