You need a differentiable model to do the sampling for you
Let V be the set of words in the vocabulary. Some models define a reinforcement learning model with a state space vector x with dimension |V|, such that x_i can be any integer in V, and a discreet action space of all integers in V.
Someone linked a paper from salesforce which follows this general idea but adds a few useful bells and whistles.