Hi everyone,
I would like to fine tune roberta for token-level classification. To understand the model behavior, I am using the popular WNUT dataset. The code snippet that I use as follows:
tokenizer = AutoTokenizer.from_pretrained("roberta-base", add_prefix_space=True)
wnut = load_dataset("wnut_17")
example = wnut["train"][0]
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
When example is tokenized with the model’s tokenizer, it adds a strange char (Ġ) as follows:
['<s>', 'Ġ@', 'p', 'aul', 'walk', 'ĠIt', "Ġ'", 's', 'Ġthe', 'Ġview', 'Ġfrom', 'Ġwhere', 'ĠI', "Ġ'", 'm', 'Ġliving', 'Ġfor', 'Ġtwo', 'Ġweeks', 'Ġ.', 'ĠEmpire', 'ĠState', 'ĠBuilding', 'Ġ=', 'ĠES', 'B', 'Ġ.', 'ĠPretty', 'Ġbad', 'Ġstorm', 'Ġhere', 'Ġlast', 'Ġevening', 'Ġ.', '</s>']
Is it expected or am I initiating the model wrongly?
Thanks a lot.