How to decode CSM tokens into audio tensors for streaming

Using the new ‘sesame/csm-1b’ model and the CsmForConditionalGeneration class I am attempting to stream the audio generation to minimize latency. I have successfully setup the ‘Optional[“BaseStreaming”]’ interface which receives tokens as they are generated, but am at a loss as to how to decode the token into audio tensors so I can stream them to something.

I tried discerning how to do this from the source code but I was unable to find a solution

1 Like

I found this.

Or with this function?