Using roberta for token-classification, strange characters

Hi everyone,

I would like to fine tune roberta for token-level classification. To understand the model behavior, I am using the popular WNUT dataset. The code snippet that I use as follows:

    tokenizer = AutoTokenizer.from_pretrained("roberta-base", add_prefix_space=True)

    wnut = load_dataset("wnut_17")

    example = wnut["train"][0]

    tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
    tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])

When example is tokenized with the model’s tokenizer, it adds a strange char (Ġ) as follows:

['<s>', 'Ġ@', 'p', 'aul', 'walk', 'ĠIt', "Ġ'", 's', 'Ġthe', 'Ġview', 'Ġfrom', 'Ġwhere', 'ĠI', "Ġ'", 'm', 'Ġliving', 'Ġfor', 'Ġtwo', 'Ġweeks', 'Ġ.', 'ĠEmpire', 'ĠState', 'ĠBuilding', 'Ġ=', 'ĠES', 'B', 'Ġ.', 'ĠPretty', 'Ġbad', 'Ġstorm', 'Ġhere', 'Ġlast', 'Ġevening', 'Ġ.', '</s>']

Is it expected or am I initiating the model wrongly?

Thanks a lot.