What is the proper way of handling multiple features in Huggingface?

I am working on a project that labels different political videos based on a certain features like: text-to speech, text in the video, face emotions, metadata and so on.
I am new to Huggingface and ML in general, so my question is: what is the right way to put all these features together in Huggingface? I’ve tried 2 approaches: to put every feature as text in a dataset updating the Features object accordingly and to put everything together separated by comma or separator. Surprisingly, the second approach did better and I’ve got a better accuracy. Why is that so? How am I supposed to add integer data or floating point measures (e.g., level of happiness of a speaker) to my neural network?

Thank you!