Extracting and adding document clustering features to a document classification model

MaximusDecimusMeridi · March 30, 2022, 4:25pm

I currently have a (huggingface pretrained distillbert) pytorch model which is finetuned on my own transaction data, for the purpose of classifying transactions as one of 5 classes. I currently extract the embeddings, then append some other extracted features (documents have time and date as well as a brief description), and finally run it through a final classification layer.

However, there is a lot of valuable information if I cluster them based on description by customer (vs treating them as isolated transactions), such as time between transactions and other descriptive statistics. I am trying to think of how I can approach this, given that the model is currently classifying record for record and how this should fit in with the model architecture.

One thought I have is to represent each cluster with the mean embedding and descriptive statistics, and then train a classifier on this level instead. That is, I would

get transformer embeddings for each transaction
cluster similar transaction descriptions together in a post processing step
Derive descriptive statistics for each cluster
Classify at the cluster level

My main question is, do I have to split this into two models (an embedding extracting model, then a post process clustering step, then a classifier model), or could I somehow achieve this in a single model? But any thoughts/input is welcome. Thanks!

Topic		Replies	Views
Anyone have advice on best methods to cluster BERT-embedded documents? Beginners	2	2541	August 31, 2021
Bertsum extractive summarization Beginners	0	3534	August 11, 2020
Concatenate non string features to a BERT transformers model Beginners	5	2857	March 27, 2022
Extracting the output of hidden BERT layers and re-training the BERT model on custom datasets 🤗Transformers	0	816	March 17, 2021
Get output embedding of FeatureExtractor 🤗Transformers	1	711	April 20, 2021

Extracting and adding document clustering features to a document classification model

Related topics