First off, I’m very new to data science so go easy on me.
Picture that I have a dataset with a column called “name”. The “name” column is made up of several strings that are concatenated together via a delimiter (underscore). Each of the values that then split out “should” come from master data picklists.
So for example, lets say we have cars.
HONDA_CIVIC_2014_BLUE
TESLA_MODEL3_2023_WHITE
The schema would be something like MAKE_MODEL_YEAR_COLOUR.
Now imagine the schema is not applied consistently, and also the set of master data values are not always adhered to. So in my data I could have
TESLE_2012_MODEL7_BLACK
I’m trying to take the data set and get the the attributes identified and values properly parsed out, and maybe some kind of “confidence” measure for each attribute/value. The goal would be to show these values to a human to have them correct any discrepancies.
Is there something in the machine learning realm to help me build this out? I’m starting from a very basic level of knowledge here, but any pointers in the right direction would be great. For example, is this a KNN opportunity?
Thanks for helping a complete newbie!