No PreTrainedTokenizerFast for Deberta-V3, no doc_stride

numomcmc · July 13, 2022, 4:44pm

Hi All

I would like to fine-tune a deberta-v3 pytorch model for Squad-v2 and then other downstream Q&A tasks.

The problem is that the doc_stride option in my arg is causing the following error:

NotImplementedError: return_offset_mapping is not available when using Python tokenizers.To use this feature, change your tokenizer to one deriving from transformers.PreTrainedTokenizerFast.More information on available tokenizers at https://github.com/huggingface/transformers/pull/2674

because there is no PreTrainedTokenizerFast for deberta-v3 yet. So…

Can I use deberta-v2 PreTrainedTokenizerFast instead? I would like to think that just because v3 switched to ELECTRA, that change may not be affected by the tokenizers, so maybe I can get away with the v2 tokenizer? Is this just wishful thinking?

Also, just so happens that v3 only has “base”, “large” and “xsmall”, while v2 only has all the other sizes… I would suppose that because of the vocab size difference, token indices and embeddings will be different across different model sizes. That just sounds like a recipe for disaster if I mix and match them…

Any suggestions on how to proceed is much appreciated!

SteX

Topic		Replies	Views
How to Finetune Deberta Model on SQUAD dataset? 🤗Transformers	2	1165	January 27, 2021
Word_ids not working with deberta_v2 🤗Tokenizers	1	1307	August 12, 2022
Different Behaviors between Tokenizers for Question Answering 🤗Transformers	0	337	October 20, 2021
Can someone help guide how to finetune DeBERTa V3 model? Models	1	1212	August 25, 2024
Cant load deberta tokenizer Beginners	0	678	March 27, 2021

No PreTrainedTokenizerFast for Deberta-V3, no doc_stride

Related topics