Can I use ML for this?

First off, I’m very new to data science so go easy on me.

Picture that I have a dataset with a column called “name”. The “name” column is made up of several strings that are concatenated together via a delimiter (underscore). Each of the values that then split out “should” come from master data picklists.

So for example, lets say we have cars.

HONDA_CIVIC_2014_BLUE

TESLA_MODEL3_2023_WHITE

The schema would be something like MAKE_MODEL_YEAR_COLOUR.

Now imagine the schema is not applied consistently, and also the set of master data values are not always adhered to. So in my data I could have

TESLE_2012_MODEL7_BLACK

I’m trying to take the data set and get the the attributes identified and values properly parsed out, and maybe some kind of “confidence” measure for each attribute/value. The goal would be to show these values to a human to have them correct any discrepancies.

Is there something in the machine learning realm to help me build this out? I’m starting from a very basic level of knowledge here, but any pointers in the right direction would be great. For example, is this a KNN opportunity?

Thanks for helping a complete newbie!

What approaches have you tried besides machine learning so far? If you have a list of the possible words for each entity, e.g. {“color”: [‘black’, ‘silver’, ‘blue’, … ], “year”: range(1950,2024), …}, then maybe you could try fuzzy matching each piece of the delimited string to the possible makes/models/years/colors that string could be. The edit distance between the input string and the closest reference string would serve as a confidence metric.

The most relevant topic in machine learning to your request would probably be named entity recognition, but I’d avoid using ML if there are simpler heuristics available.