Hi @bhadresh-savani, as far as I can tell the problem seems to lie with your find_sublist_indices function, not on the availability of a fast tokenizer.
One simple thing to try: can you pass a slice of examples to your convert_to_features function, e.g.
convert_to_features(train_dataset[:3])
I’m not sure whether this will solve the problem, but perhaps your find_sublist_indices is expected a list of lists which is what you’ll get from the slice.
I also noticed that your convert_to_features function is quite different to the prepare_train_features in the tutorial - what happens if you try the latter with your tokenizer?
I tried to run old version v3.5.1 by keeping latest version of modeling_deberta.py file with few changes (i needed QuestionAnsweringModelOutput class for SQUAD kind of training)
I was getting below error
Traceback (most recent call last):
File "run_squad.py", line 820, in <module>
main()
File "run_squad.py", line 734, in main
model = AutoModelForQuestionAnswering.from_pretrained(
File "/media/data2/anaconda/envs/transformers-hugginface/lib/python3.8/site-packages/transformers/modeling_auto.py", line 1330, in from_pretrained
raise ValueError(
ValueError: Unrecognized configuration class <class 'transformers.configuration_deberta.DebertaConfig'> for this kind of AutoModel: AutoModelForQuestionAnswering.
Model type should be one of DistilBertConfig, AlbertConfig, CamembertConfig, BartConfig, LongformerConfig, XLMRobertaConfig, RobertaConfig, SqueezeBertConfig, BertConfig, XLNetConfig, FlaubertConfig, MobileBertConfig, XLMConfig, ElectraConfig, ReformerConfig, FunnelConfig, LxmertConfig.
find_sublist_indices i created by taking ref of this notebook which uses a fast tokenizer, I am trying to do the same without fast tokenizer
Fast tokenizer has method called char_to_token i am trying to implement the same on Python based tokenizer.