How to get data from hf dataset to readable format for whisper-timestamped

I am trying to use whisper timestamped (GitHub - linto-ai/whisper-timestamped: Multilingual Automatic Speech Recognition with word-level timestamps and confidence) to get word level timestamps on data from common voice dataset. But I am struggling to get audio to format for function load_audio. When I try to pass the audio I get following error: TypeError: expected str, bytes or os.PathLike object, not ndarray.
I tried to change to bytes but then got: ValueError: embedded null byte. I feel there must be some better way, but can’t figure this out. Can anyone suggast any hint or solution please?