Binary classification of text files in two directories

I am trying to do transfer learning on GPT-Neo to distinguish scam websites and normal websites from their content and I am completely confused as to what I should do next. I have already used some code to scrape websites content and parsed them using bs4. Now only the website text is stored in different directories using txt format. My directory structure looks like this. The two root folders are the two classes (“Scam” and “Normal”). In each class, there are more subdirectories with the website’s url as names, and then within them is the parsed html page in txt.

Scam/
          Website1/
                      content.txt
          Website2/
                       content.txt
                    ...
Normal/
          Website1/
                      content.txt
          Website2/
                       content.txt
                    ...

I have read a lot of documentation but I am not sure on what I should do next. Do I extract the text in each file, merge a [0,1] label and make a big csv? What’s next? Tokenize the text column of the csv and feed it to the input layer of the transformer? I would appreciate any advice!

Yes, that’s already a good idea. You can indeed make 1 dataset with 2 columns, text and label.

Next, there are several options:

  • Either you create your dataset as a csv file, and then you turn it into a HuggingFace Dataset object, as follows:
from datasets import load_dataset
dataset = load_dataset('csv', data_files='my_file.csv')

… or if you have multiple files:

dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv', 'my_file_3.csv'])

… or if you already want to determine which ones are for training, which ones for testing:

dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'] 'test': 'my_test_file.csv'})

The benefit of HuggingFace Datasets is that it allows you to quickly tokenize the entire dataset and prepare it for the model, using the .map(function) functionality.

  • Alternatively, you can implement a classic PyTorch dataset with the getitem method. Each dataset item should then return the input_ids, attention_mask and label.