I have a dataset of machine-generated sequences that are not natural language, but the order of the elements in the sequence is important. I want to create word embeddings using BERT to capture the sequential relationships between these elements. Can anyone provide guidance on how to preprocess and format the data for input into BERT, and how to fine-tune the model to generate useful embeddings for this type of data?
Note: The vocabulary in my data is not present in pre trained bert model, Can anyone guide me how to achieve my goal?
example of my vocabulary(list of sentences) = [‘ixeg6164 ox78dsf12 lx3cd875’, ‘duish7 oiu587 kj854j 987hdk’ …]