I’m trying to instantiate a tokenizer from a vocab file after it’s been read into python. This is because I want to decouple reading objects from disk from model loading, so I want to load files into python in a different way, and then use those python objects to instantiate the hugging face objects. I can do this with actual model itself like this:
with open('pytorch_model.bin', 'rb') as f: buffer = io.BytesIO(f.read()) with open('config.json', 'r') as f: config = DistilBertConfig.from_dict(json.load(f)) torch_model = torch.load(buffer, map_location=torch.device('cpu')) model_test = DistilBertModel.from_pretrained(pretrained_model_name_or_path=None, state_dict=torch_model, config=config)
But I can’t find a way to do it with the tokenizer. Does anyone have an idea how to do this?