Appropriate tokenizer for particular dataset?

[Reading the forum help, it seemed appropriate to post here rather than in beginners, even though I am a beginner, because the question is specific to Transformers…]

I have a dataset that looks like this (with approx 10k records)

| 2406 4713 8521 .
| 1309 3417 7211 8313 9403 .
| 1102 5403 .

I.e., there are start and stop characters, and each row is an arbitrary length (usually between 1-15 values).

The data are a musical encoding of melodies I’m working on. I’d like to generate new complete sequences that have a strong relationship to the dataset.

Can anyone recommend an appropriate tokenizer for this task?

many thanks, Michael