Using roberta for token-classification, strange characters

isspek · July 10, 2023, 7:34am

Hi everyone,

I would like to fine tune roberta for token-level classification. To understand the model behavior, I am using the popular WNUT dataset. The code snippet that I use as follows:

    tokenizer = AutoTokenizer.from_pretrained("roberta-base", add_prefix_space=True)

    wnut = load_dataset("wnut_17")

    example = wnut["train"][0]

    tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
    tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])

When example is tokenized with the model’s tokenizer, it adds a strange char (Ġ) as follows:

['<s>', 'Ġ@', 'p', 'aul', 'walk', 'ĠIt', "Ġ'", 's', 'Ġthe', 'Ġview', 'Ġfrom', 'Ġwhere', 'ĠI', "Ġ'", 'm', 'Ġliving', 'Ġfor', 'Ġtwo', 'Ġweeks', 'Ġ.', 'ĠEmpire', 'ĠState', 'ĠBuilding', 'Ġ=', 'ĠES', 'B', 'Ġ.', 'ĠPretty', 'Ġbad', 'Ġstorm', 'Ġhere', 'Ġlast', 'Ġevening', 'Ġ.', '</s>']

Is it expected or am I initiating the model wrongly?

Thanks a lot.

Topic		Replies	Views
Build a RoBERTa tokenizer from scratch 🤗Tokenizers	5	3347	December 12, 2020
Creating a custom tokenizer for Roberta Beginners	5	4322	August 1, 2021
Fine tune a saved model with custom tokenizer 🤗Transformers	3	2960	December 15, 2020
Fine-tune TF-XML-ROBERTa for token classification Beginners	1	808	April 13, 2021
Issue with tokenizer.tokenize 🤗Tokenizers	3	503	November 16, 2020

Using roberta for token-classification, strange characters

Related topics