Text classification for an unknown language

housebunting · March 7, 2024, 2:36pm

I am a complete newbie who is trying to classify text from an unknown, arbitrary language. In my particular case, this not even a natural language but JSON files in computer-speak that I’ve flattened into linear text (i.e., the “sentences”).

What is the best approach to this? I thought of using a preexisting text classifier (e.g., “sentiment” classifier), but as far as I know all of those are trained on natural languages, which may be a far cry from JSON files.

callmyname · March 7, 2024, 4:40pm

Hi! In JSON object, it’s possible to place even an encoded picture as a string. So, If you can, provide more information about the data you want to work with. It will be much easier to give you a precise answer.

Have a nice … !

housebunting · March 7, 2024, 6:35pm

I’d actually leave this more open-ended on purpose. Although I do happen to use a JSON, what I meant to say is that my data looks as follows,

categorical_label, feature_1, …, feature_N,

where feature_k is any arbitrary sequence (be it a JSON, the string of bits from an image, etc). My point is that these sequences are not natural language, so a basic off-the-shelf sentence classifier wouldn’t work.

callmyname · March 7, 2024, 10:20pm

Load & Preprocess dataset:

https://huggingface.co/learn/nlp-course/chapter5/2

Adapt tokenizer to this task:

https://huggingface.co/learn/nlp-course/chapter6/2

Find the most similar model to your data, here’s an example with GPT2 into code-interpreter, but I assume that you’ve got some blockchain dump in this JSON, so maybe find some larger dataset very similar to your data

Finally, fine-tune this pretrained model to the classification task.
Share with me this tool.

Topic		Replies	Views
Multilingual NLP with BERT Beginners	0	376	December 14, 2021
CodeClassifier: Shall i use Transformers or my own Custom Architecture Beginners	0	92	April 27, 2024
Fine-tune multilingual model to classify languages other than training set Beginners	0	259	March 15, 2021
Best model for entity recognition, text classification and sentiment analysis? Beginners	1	515	November 15, 2021
I need help with how to approach my project Beginners	0	226	January 24, 2024

Text classification for an unknown language

Related topics