Binary classification of text files in two directories

terbiume65 · October 15, 2021, 4:28pm

I am trying to do transfer learning on GPT-Neo to distinguish scam websites and normal websites from their content and I am completely confused as to what I should do next. I have already used some code to scrape websites content and parsed them using bs4. Now only the website text is stored in different directories using txt format. My directory structure looks like this. The two root folders are the two classes (“Scam” and “Normal”). In each class, there are more subdirectories with the website’s url as names, and then within them is the parsed html page in txt.

Scam/
          Website1/
                      content.txt
          Website2/
                       content.txt
                    ...
Normal/
          Website1/
                      content.txt
          Website2/
                       content.txt
                    ...

I have read a lot of documentation but I am not sure on what I should do next. Do I extract the text in each file, merge a [0,1] label and make a big csv? What’s next? Tokenize the text column of the csv and feed it to the input layer of the transformer? I would appreciate any advice!

nielsr · October 18, 2021, 7:40am

Yes, that’s already a good idea. You can indeed make 1 dataset with 2 columns, text and label.

Next, there are several options:

Either you create your dataset as a csv file, and then you turn it into a HuggingFace Dataset object, as follows:

from datasets import load_dataset
dataset = load_dataset('csv', data_files='my_file.csv')

… or if you have multiple files:

dataset = load_dataset('csv', data_files=['my_file_1.csv', 'my_file_2.csv', 'my_file_3.csv'])

… or if you already want to determine which ones are for training, which ones for testing:

dataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'] 'test': 'my_test_file.csv'})

The benefit of HuggingFace Datasets is that it allows you to quickly tokenize the entire dataset and prepare it for the model, using the .map(function) functionality.

Alternatively, you can implement a classic PyTorch dataset with the getitem method. Each dataset item should then return the input_ids, attention_mask and label.

Topic		Replies	Views
Guidance Needed on Choosing the Right Dataset Format for Transformer Model Training 🤗Datasets	1	1751	December 8, 2023
Confusion in splitting dataset (from imagefolder) into train, test and validation 🤗Datasets	2	5694	August 12, 2022
From Pandas Dataframe to Huggingface Dataset Beginners	9	66077	December 20, 2024
Help making object detection dataset Beginners	4	34	April 26, 2025
Dictionary of two lists to datasets and fine tuning advices for fr-it translation Beginners	6	830	July 15, 2022

Binary classification of text files in two directories

Related topics