How do I train a new tokenizer on a list of texts?

degeso · August 15, 2023, 9:24am

I have ~2000 PDFs, each ~1000 pages long. I will OCR all pages and I end up with a list of strings with each string representing the text of one page. Currently I have a subset of just 6 OCR’d pages which I use for testing the code.

My end goal is to obtain a CLS vector for each page, calculate the cosine similarity of these vectors for adjacent pages, and determine whether two adjacent pages belong to the same document (there are many documents in each of the 2000 PDFs).

As the texts are in German and use medical vocabulary I chose this checkpoint:

Nevertheless, I figured it would be a good idea to train the tokenizer on my data rather than using it in its vanilla form because the documents may contain a lot of proper nouns that might be important.

I read parts of the Hugging Face NLP Course, but I couldn’t find documentation on how I would train the tokenizer using my list of texts/strings. The example in this part of the Tokenizer tutorial only explains how to train a tokenizer using the data downloaded with the Datasets library.

So how do I go about training the tokenizer using my own text data? Could somebody more familiar with Hugging Face point me in the right direction?

P.S. 1: To get good results, will I also need to modify/pretrain the model itself or only the tokenizer?
P.S. 2: I am also considering using cosine similarity on the top X pixels of the PDF pages as an image as one can often determine the type of a document based on logos or characteristic header layouts. Does this sound like a good idea for the task at hand?

panigrah · November 5, 2023, 2:41am

you can create a dataset from any source so replace this step here

raw_datasets = load_dataset("code_search_net", "python")

if you wanted to read it from a local file
raw_datasets = load_dataset('csv', '...path to csv file')

see here for the different ways to load Load

Topic		Replies	Views
Text classification training on long text Intermediate	3	4990	June 18, 2024
Train with Text Beginners	0	201	October 20, 2023
Can someone point me to docs for how to train my own a model? Models	2	621	January 3, 2023
Training BERT from scratch with Wikipedia + Book Corpus Dataset 🤗Transformers	1	4644	January 22, 2021
Seeking Guidance on Creating and Training a Model with a Specific Dataset Beginners	4	504	February 2, 2024

How do I train a new tokenizer on a list of texts?

Related topics