BERT for NER output of only '0'

dmandair · November 14, 2021, 6:35pm

Hi everyone! I’d really appreciate your help with an issue I’m having with BERT for NER in a highly specialized domain. I’m using BERT for token classification and used much of the format from here (Named entity recognition with Bert) as inspiration. I had to generate my own B/I labeling for my dataset, which looks fine - some samples are below. Text lengths were also often longer than BERT max tokens of 512 so I had a chunk a given text piece into smaller chunks (of which at the start I’d place a [CLS] token. I didn’t really add [SEP] tokens on any of these chunked pieces as they’d have padding to reach the length of 512. The labels, input id’s, padding and attention masks appear fine and I’ve also pasted some samples below. I’ve also checked the output numerous times - again my code is below for the predict function. I had to aggregate the byte level predictions to come up with labels at the original word level and did this by averaging over the byte level logit outputs if a word was split by the tokenzier (this is done by the aggregate function). Another idiosyncratic part of my predict function is I had to know indices in the original text that correspond to tokens - the first part of the predict function is just storing those values for the original ‘words’. What’s weird is if I train the BERT model on a handful of data samples (for instance, <10), I’ll get some output that has none ‘0’ predictions. Training over a larger set seems to result in only ‘0’ predictions. Really don’t understand what’s going on at all and would appreciate any help - thank you so much!

[… ‘P’, ‘##H’, ‘##Y’, ‘##SI’, ‘##CI’, ‘##AN’, ‘IN’, ‘##TE’, ‘##RP’, ‘##RE’, ‘##TA’, ‘##TI’, ‘##ON’, ‘:’, ‘N’, ‘##eg’, ‘##ative’, ‘:’, ‘Tu’, ‘##mor’, ‘cells’, ‘showing’, ‘no’]

[…‘0’, ‘0’, ‘0’, ‘0’, ‘0’, ‘0’, ‘B-HER2’, ‘B-HER2’, ‘B-HER2’, ‘B-HER2’, ‘I-HER2’, ‘I-HER2’, ‘I-HER2’, ‘I-HER2’, ‘I-HER2’]

def predict(test_texts, test_note_ids, model, tokenizer):
    offset = 0
    model.eval()
    for test_text, test_note_id in zip(test_texts, test_note_ids):
        num_notes += 1
        chunkified_tokenized_texts, _, _ = chunkify_note(test_text, None, test_note_id)
        predictions = []
        for split_text in chunkified_tokenized_texts:
            tokenized_test_text = tokenizer.encode(split_text)
            tok_list = []
            for m in re.finditer(r'\S+', split_text):
                token = m.group(0)
                sub_tokens = tokenizer.tokenize(token)
                tok_list.append((token, offset+m.start(), offset+m.end() - 1))
            offset = len(split_text)

            if torch.cuda.is_available():
                input_ids = torch.tensor([tokenized_test_text]).cuda()
            else:
                input_ids = torch.tensor([tokenized_test_text])
            with torch.no_grad():
                output = model(input_ids)
            if torch.cuda.is_available():
                scores = output[0].detach().cpu().numpy()
            else:
                scores = output[0].detach().numpy()

            pred = aggregate(scores, tokenizer, tok_list)

            predictions.extend(pred) #then do some extra processing to store this

Input Id’s on the input:
[ 101 1239 1104 115 115 115 115 115 117 115 115 115
115 115 115 115 115 115 115 4183 2524 115 115 115
115 115 115 115 115 115 115 115 115 115 115 115
117 115 115 115 115 115 115 115 115 115 115 115
115 115 115 115 115 115 115 115 115 117 9292 117
7735 115 115 115 115 115 115 115 115 115 115 117
115 115 115 115 115 115 115 115 115 115 118 115
115 115 115 115 115 115 115 115 115 115 115 115
115 115 2524 11341 131 113 115 115 115 115 115 114
115 115 115 115 115 118 115 115 115 115 115 115
115 115 115 115 115 115 115 115 115 119 115 115
115 115 115 117 9292 117 7735 143 7897 131 113 115
115 115 115 115 114 115 115 115 115 115 118 115
115 115 115 115 9666 9970 18653 11922 131 115 115 115
115 115 137 115 115 115 115 115 119 115 115 115
115 115 115 115 115 115 115 115 115 115 115 115
115 115 115 115 115 117 7735 115 115 115 115 115
115 115 115 115 115 117 9292 117 7735 150 13901 8231
2591 10783 2069 8544 24162 13901 2346 2349 3663 155 16668 9565
1942 7195 9080 10208 131 115 115 115 115 115 117 115
115 115 115 115 115 115 115 115 115 108 131 115
115 115 115 115 118 115 115 115 115 115 2508 1181
119 11336 1665 119 108 131 115 115 115 115 115 159
26868 1204 108 131 115 115 115 115 115 2516 14265 131
1429 120 1407 120 1446 26316 131 1492 2162 9850 131 9714
11336 21437 131 1367 120 5037 120 1446 140 19526 131 115
115 115 115 115 141 2346 2064 131 1429 120 1429 120
2679 113 4936 131 4062 114 7642 6834 27989 1389 113 188
114 131 115 115 115 115 115 115 115 115 115 115
119 115 115 115 115 115 113 113 115 115 115 115
115 114 115 115 115 115 115 118 115 115 115 115
115 114 115 115 115 115 115 115 115 115 115 115
113 113 115 115 115 115 115 114 115 115 115 115
115 118 115 115 115 115 115 114 2687 7136 6902 131
15075 10584 1179 118 4275 117 18311 16274 118 11783 7918 1113
2525 15766 119 115 115 115 115 115 118 115 115 115
115 115 115 115 115 115 115 131 4114 7209 15961 145
1775 131 115 115 115 115 115 119 5539 1580 115 115
115 115 115 115 115 115 115 115 7277 1643 17489 27259
1118 143 6258 3048 10722 23904 9664 18082 2069 3663 153 3048
3663 13882 19747 14962 15969 12880 20336 16941 9159 21669 11414 131
151 12606 5838 131 17037 26271 3652 4000 1185 0 0 0
0 0 0 0 0 0 0 0]

Tagged values:
[ 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
9 9 9 9 9 9 9 9 9 9 9 9 6 6 6 6 4 4 4 4 4 27 27 27
27 27 27 27 27 27 27 27]

Attention mask:
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Topic		Replies	Views
BERT Model predicting 'PAD' for NER Beginners	0	597	November 11, 2021
Model gives output even for SEP token Models	0	481	February 1, 2023
Ask for help with prediction results of Named Entity Recognition Task 🤗Transformers	10	3230	May 21, 2021
How to fine tune bert on entity recognition? Beginners	23	7362	November 21, 2022
BioBERT NER issue Beginners	7	4560	November 27, 2022

BERT for NER output of only '0'

Related topics