So I dove in pretty deep & made many edits to the pipeline
to work out the core issue & get something that worked. Afterwards I took a step-back and wrote something easier that simply wraps the output of calls to pipeline
(see below)
Notes:
- Set
aggregation_strategy=None
in the pipeline -
x
is the output of the huggingface token classification pipeline - âdesired_tokensâ is the list of tokens that you want to aggregate to, e.g. in my example above `[âPrice:â, " $4,290,000", âNumberââŚ]
I have learnt a-lot by diving in - but I have been left confused why is_split_into_words
in the tokenizer doesnât behave like I thought it would. Even though I explicitly pass $4,290,000
as itâs own word, the tokenizer still splits it up into itâs components - but without the ##
that would classify them as sub-tokens to be re-aggregated.
def aggregate(x, desired_tokens):
joined_word = ''
joined_group = []
new_x = []
for i in x:
joined_word += i['word'].replace('#','')
joined_group.append(i)
if joined_word == desired_tokens[0]:
new_i = {'entity':joined_group[0]['entity'], 'label':joined_group[0]['label'], 'word':joined_word,
'start':joined_group[0]['start'], 'end':joined_group[-1]['end'], 'score':joined_group[0]['score']}
new_x.append(new_i)
joined_word = ''
joined_group = []
desired_tokens = desired_tokens[1:]
return new_x