I’m working on building a system to predict column headers for csv files based on the content of each column. The labels to find are First Name, Last Name, Company, Address 1, Address 2, City, State, and ZIP.
I’ve successfully fine-tuned a variant of BERT on my labeled training data, but still get poor predictions from it. I’ve also tried just using the python package usaddress which performs a similar task, which was also unreliable.
There’s two aspects of this task that I believe explain why these don’t perform well for me. First, the csv files I’ll be running predictions on won’t have a reliable column ordering, they can be mixed up. Second, these models rely heavily on context to make good predictions.
So, if I pass an entire row from the CSV for a prediction, if the ordering is messed up I will get bad predictions due to the disordered context confusing the model. The other option I found was making predictions on each individual field, by asking it to predict a label for just strings like “1234 Main St” or “San Francisco” without including any other text from the row, but the models still don’t perform well on this, as they just have no context now, instead of the incorrectly ordered context.
It seems that these types of models are overly complex for my task, and that the attention to context that normally helps these models make better predictions is actually what’s hurting my predictions here since I’m either depriving the model of context or giving it disordered context.
I’m just looking for some insight into how to approach this kind of task, and/or if there are better choices of models to use for this type of task besides transformers since those all rely on context. I am still pretty new to machine learning, so apologies for any inaccuracies or if I didn’t explain this clearly enough.