How do I train a new tokenizer on a list of texts?

I have ~2000 PDFs, each ~1000 pages long. I will OCR all pages and I end up with a list of strings with each string representing the text of one page. Currently I have a subset of just 6 OCR’d pages which I use for testing the code.

My end goal is to obtain a CLS vector for each page, calculate the cosine similarity of these vectors for adjacent pages, and determine whether two adjacent pages belong to the same document (there are many documents in each of the 2000 PDFs).

As the texts are in German and use medical vocabulary I chose this checkpoint:

Nevertheless, I figured it would be a good idea to train the tokenizer on my data rather than using it in its vanilla form because the documents may contain a lot of proper nouns that might be important.

I read parts of the Hugging Face NLP Course, but I couldn’t find documentation on how I would train the tokenizer using my list of texts/strings. The example in this part of the Tokenizer tutorial only explains how to train a tokenizer using the data downloaded with the Datasets library.

So how do I go about training the tokenizer using my own text data? Could somebody more familiar with Hugging Face point me in the right direction? :slight_smile:

P.S. 1: To get good results, will I also need to modify/pretrain the model itself or only the tokenizer?
P.S. 2: I am also considering using cosine similarity on the top X pixels of the PDF pages as an image as one can often determine the type of a document based on logos or characteristic header layouts. Does this sound like a good idea for the task at hand?

you can create a dataset from any source so replace this step here

raw_datasets = load_dataset("code_search_net", "python")

if you wanted to read it from a local file
raw_datasets = load_dataset('csv', '...path to csv file')

see here for the different ways to load Load