Understanding the Decision Transformer

I’m trying to implement my own Decision Transformer for some Reinforcement Learning application.
Even if the tutorial on huggingface helped me a lot in creating a first running example with my own data, I stuck on the values returned by the class Datacollator (DecisionTransformerGymDataCollator) in the huggingface tutorial linked above.
Basically the collator return the following values:

        return {
            "states": s,
            "actions": a,
            "rewards": r,
            "returns_to_go": rtg,
            "timesteps": timesteps,
            "attention_mask": mask,
        }

But which “token” does the DT model use for predicting the next action?
In case of NLP you have words, which are tokenized to integer values. Let’s put is simple and let’s say: 1 work → 1 token. So you have one unique integer value for one work.
But what’s the equivalent in the case of the Decision Transformer? Considering the state, the rewards and the return to go, you have at least 3 different values that cannot be tokenized in one integer value as in the case of NLP.

Any help?
Thanks