I am a complete newbie who is trying to classify text from an unknown, arbitrary language. In my particular case, this not even a natural language but JSON files in computer-speak that I’ve flattened into linear text (i.e., the “sentences”).
What is the best approach to this? I thought of using a preexisting text classifier (e.g., “sentiment” classifier), but as far as I know all of those are trained on natural languages, which may be a far cry from JSON files.
Hi! In JSON object, it’s possible to place even an encoded picture as a string. So, If you can, provide more information about the data you want to work with. It will be much easier to give you a precise answer.
Have a nice … !
I’d actually leave this more open-ended on purpose. Although I do happen to use a JSON, what I meant to say is that my data looks as follows,
categorical_label
, feature_1
, …, feature_N
,
where feature_k
is any arbitrary sequence (be it a JSON, the string of bits from an image, etc). My point is that these sequences are not natural language, so a basic off-the-shelf sentence classifier wouldn’t work.
- Load & Preprocess dataset:
https://huggingface.co/learn/nlp-course/chapter5/2
- Adapt tokenizer to this task:
https://huggingface.co/learn/nlp-course/chapter6/2
- Find the most similar model to your data, here’s an example with GPT2 into code-interpreter, but I assume that you’ve got some blockchain dump in this JSON, so maybe find some larger dataset very similar to your data
-
Finally, fine-tune this pretrained model to the classification task.
-
Share with me this tool.