Hello.
I’m creating my first project using Hugging Face, I want to create a chatbot that is specialist in firearms, to do this I want to fine tune GPT-2 with a csv file containing information scraped from the internet.
The dataset comprises 2000 rows and includes details such as firearm names, manufacturers, cartridges, countries of origin, and years of production. Below is a snippet of the dataset structure:
Name | Manufacturer | Cartridges | Country | Year | Weapon Description | Manufacturer Description |
---|---|---|---|---|---|---|
9A-91 | KBP Instrument Design Bureau | 9×39mm | Soviet Union | 1994 | text… | text… |
AAC Honey Badger | Advanced Armament Corporation | .300 AAC Blackout | United States | 2011 | text… | text… |
ArmaLite AR-15 | ArmaLite, Colt | 5.56×45mm NATO, 223 Remington | United States | 1959 | text… | text… |
Desert Tech MDR | Desert Tech | .223 Remington, 5.56×45mm NATO, .223 Wylde, .308 Winchester, 7.62×51mm NATO | United States | 2014 | text… | text… |
I have a few questions regarding this dataset:
1 - Is it advisable to have multiple values within a single cell, such as the cartridges column for entries like ArmaLite AR-15 and Desert Tech MDR?
2 - How might the presence of multiple values in certain entries affect the quality of the chatbot’s responses during fine-tuning?
3.- Regarding the cartridges column, approximately 90% of the entries lack commas separating each cartridge. Would it be beneficial to add commas for consistency, or should I maintain the data as is?
Any insights or advice on these questions would be greatly appreciated. Thank you for your time and assistance!