Can I use ML for this?

garnerp · March 10, 2024, 1:46am

First off, I’m very new to data science so go easy on me.

Picture that I have a dataset with a column called “name”. The “name” column is made up of several strings that are concatenated together via a delimiter (underscore). Each of the values that then split out “should” come from master data picklists.

So for example, lets say we have cars.

HONDA_CIVIC_2014_BLUE

TESLA_MODEL3_2023_WHITE

The schema would be something like MAKE_MODEL_YEAR_COLOUR.

Now imagine the schema is not applied consistently, and also the set of master data values are not always adhered to. So in my data I could have

TESLE_2012_MODEL7_BLACK

I’m trying to take the data set and get the the attributes identified and values properly parsed out, and maybe some kind of “confidence” measure for each attribute/value. The goal would be to show these values to a human to have them correct any discrepancies.

Is there something in the machine learning realm to help me build this out? I’m starting from a very basic level of knowledge here, but any pointers in the right direction would be great. For example, is this a KNN opportunity?

Thanks for helping a complete newbie!

davidhaas6 · March 10, 2024, 3:05am

What approaches have you tried besides machine learning so far? If you have a list of the possible words for each entity, e.g. {“color”: [‘black’, ‘silver’, ‘blue’, … ], “year”: range(1950,2024), …}, then maybe you could try fuzzy matching each piece of the delimited string to the possible makes/models/years/colors that string could be. The edit distance between the input string and the closest reference string would serve as a confidence metric.

The most relevant topic in machine learning to your request would probably be named entity recognition, but I’d avoid using ML if there are simpler heuristics available.

Topic		Replies	Views
I have the dataset, dont know where to start Beginners	0	126	December 20, 2023
Train model from scratch on own dataset Beginners	0	575	February 26, 2024
Dataset preparation for LayoutLM and LiLT Research	1	60	April 27, 2025
From Pandas Dataframe to Huggingface Dataset Beginners	9	67146	December 20, 2024
Seeking Guidance on Creating and Training a Model with a Specific Dataset Beginners	4	499	February 2, 2024

Can I use ML for this?

Related topics