Using the new âsesame/csm-1bâ model and the CsmForConditionalGeneration class I am attempting to stream the audio generation to minimize latency. I have successfully setup the âOptional[âBaseStreamingâ]â interface which receives tokens as they are generated, but am at a loss as to how to decode the token into audio tensors so I can stream them to something.
I tried discerning how to do this from the source code but I was unable to find a solution