Appropriate tokenizer for particular dataset?

terekita · September 18, 2022, 6:05pm

[Reading the forum help, it seemed appropriate to post here rather than in beginners, even though I am a beginner, because the question is specific to Transformers…]

I have a dataset that looks like this (with approx 10k records)

| 2406 4713 8521 .
| 1309 3417 7211 8313 9403 .
| 1102 5403 .

I.e., there are start and stop characters, and each row is an arbitrary length (usually between 1-15 values).

The data are a musical encoding of melodies I’m working on. I’d like to generate new complete sequences that have a strong relationship to the dataset.

Can anyone recommend an appropriate tokenizer for this task?

many thanks, Michael

Topic		Replies	Views
Help understanding how to build a dataset for language as with the old TextDataset 🤗Datasets	7	12718	October 6, 2021
Transformer for numeric dataset 🤗Transformers	0	644	May 20, 2023
Training with varying lengths of sequences Beginners	0	1613	May 31, 2023
Which transfomer for numeric dataset Beginners	0	284	June 4, 2023
Multi-instance transformers 🤗Transformers	0	244	September 27, 2022

Appropriate tokenizer for particular dataset?

Related topics