I am trying to do transfer learning on GPT-Neo to distinguish scam websites and normal websites from their content and I am completely confused as to what I should do next. I have already used some code to scrape websites content and parsed them using bs4. Now only the website text is stored in different directories using txt format. My directory structure looks like this. The two root folders are the two classes (“Scam” and “Normal”). In each class, there are more subdirectories with the website’s url as names, and then within them is the parsed html page in txt.
Scam/
Website1/
content.txt
Website2/
content.txt
...
Normal/
Website1/
content.txt
Website2/
content.txt
...
I have read a lot of documentation but I am not sure on what I should do next. Do I extract the text in each file, merge a [0,1] label and make a big csv? What’s next? Tokenize the text column of the csv and feed it to the input layer of the transformer? I would appreciate any advice!