Text classification for an unknown language

I am a complete newbie who is trying to classify text from an unknown, arbitrary language. In my particular case, this not even a natural language but JSON files in computer-speak that I’ve flattened into linear text (i.e., the “sentences”).

What is the best approach to this? I thought of using a preexisting text classifier (e.g., “sentiment” classifier), but as far as I know all of those are trained on natural languages, which may be a far cry from JSON files.

Hi! In JSON object, it’s possible to place even an encoded picture as a string. So, If you can, provide more information about the data you want to work with. It will be much easier to give you a precise answer.

Have a nice … !

I’d actually leave this more open-ended on purpose. Although I do happen to use a JSON, what I meant to say is that my data looks as follows,

categorical_label, feature_1, …, feature_N,

where feature_k is any arbitrary sequence (be it a JSON, the string of bits from an image, etc). My point is that these sequences are not natural language, so a basic off-the-shelf sentence classifier wouldn’t work.

  1. Load & Preprocess dataset:

https://huggingface.co/learn/nlp-course/chapter5/2

  1. Adapt tokenizer to this task:

https://huggingface.co/learn/nlp-course/chapter6/2

  1. Find the most similar model to your data, here’s an example with GPT2 into code-interpreter, but I assume that you’ve got some blockchain dump in this JSON, so maybe find some larger dataset very similar to your data
  1. Finally, fine-tune this pretrained model to the classification task.

  2. Share with me this tool.