I have a data set of 25000 engineering projects, each of which has been classified into about 5500 industry names. Here is a sample of 10 tuples of this data set:
[('Social Feasibility Strategy for the SSMA Creeks Program',
'Urban development and housing'),
('Banjul Sewerage and Drainage', 'Water supply and sanitation'),
('Wirgane Dam Project', 'Water and sanitation'),
('Nis Residental Energy Efficiency',
'Municipal and environmental infrastructure'),
('Northern Smallholder Livestock Commercialization Project: Rural Financial Services Programme',
'Credit and financial services'),
('Climate Change and Disaster-Resilient Water Resources Sector Project',
'Agriculture, natural resources and rural development'),
('Banque Rwandaise de Developpement Project (01)',
'(historic)ficial sector development'),
('Cashew & Coconut Treecrops Project',
'General fice sector; agricultural extension, research, and other support activities; crops'),
('Employment & Training System Project', 'Education'),
('Action Plan for C and D Countries', 'Other')]
As you can see, the first element of each tuple contains the project name, and the second one as “basic” sector description which staff have assigned on a “best effort” basis but without much deduplication. Hence this large number of 5500 distinct classifications.
I need to classify all this into a much narrower set of industry categories, defined by the GICS classification. Basically this has 17 top level sectors, subdivided progressively into 65, then 137, then 255 increasingly detailed sectors.
Here are 5 examples:
('Real Estate',
'Equity Real Estate Investment Trusts (REITs)',
'Residential REITs',
'Residential REITs'),
('Real Estate',
'Real Estate',
'Real Estate Management & Development',
'Retail REITs'),
('Industrials',
'Food, Beverage & Tobacco',
'Food Products',
'Agricultural Products'),
('Information Technology',
'Semiconductors & Semiconductor Equipment',
'Semiconductors & Semiconductor Equipment',
'Semiconductor Equipment'),
('Financials',
'Diversified Financials',
'Diversified Financial Services',
'Financial Exchanges & Data'),
('Financials',
'Diversified Financials',
'Thrifts & Mortgage Finance',
'Thrifts & Mortgage Finance')
I want to use as much information as possible in each of these category tuples, to get a model to classify my (25000_project_names, 5000_sectors) tuples. Which is the most appropriate Huggingface model to do this? And how do I use all the information in each set of input and GICS tuples?