Hello,
I am trying to train the GPT-2 model on sagemaker and have uploaded my training dataset to a private repo on Hugging Face.
Here is my code below
#Authentication to hugging face hub here
from huggingface_hub import notebook_login
notebook_login()
hyperparameters = {
'model_name_or_path':'gpt2',
'output_dir':'/opt/ml/model',
'dataset_name' : 'E1l1dh/frl_training_dataset',
'do_train': True,
'do_eval': True,
'per_device_eval_batch_size':2,
'per_device_train_batch_size':2,
'gradient_accumulation_steps':8,
}
# configuration for running training on smdistributed Model Parallel
mpi_options = {
"enabled" : True,
"processes_per_host" : 4,
}
smp_options = {
"enabled":True,
"parameters": {
"microbatches": 2,
"placement_strategy": "spread",
"pipeline": "interleaved",
"optimize": "speed",
"partitions": 2,
"ddp": True,
# "block_size" = 256,
}
}
distribution={
"smdistributed": {"modelparallel": smp_options},
"mpi": mpi_options
}
# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.17.0'}
# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
entry_point='run_clm.py',
source_dir='./examples/pytorch/language-modeling',
instance_type='ml.p3.8xlarge',
git_config = git_config,
instance_count=1,
role=role,
transformers_version='4.17.0',
pytorch_version='1.10.2',
py_version='py38',
hyperparameters = hyperparameters,
# metric_definitions = metric_definitions,
output_path = output_bucket,
base_job_name = 'GPT2-v1',
distribution = distribution
)
# starting the train job
huggingface_estimator.fit(inputs={'training':training_input_path,
'test':test_input_path})
And here is the error message I receive:
[1,mpirank:3,algo-1]<stderr>: raise RepositoryNotFoundError(message, response) from e
[1,mpirank:3,algo-1]<stderr>:huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-63eb9c53-27586b8b2270acf375b087b8)
[1,mpirank:3,algo-1]<stderr>:
[1,mpirank:3,algo-1]<stderr>:Repository Not Found for url: https://huggingface.co/api/datasets/E1l1dh/frl_training_dataset.
[1,mpirank:3,algo-1]<stderr>:Please make sure you specified the correct `repo_id` and `repo_type`.
[1,mpirank:3,algo-1]<stderr>:If the repo is private, make sure you are authenticated.
[1,mpirank:3,algo-1]<stderr>:Invalid username or password.
------------------------------------------------------------
I am a little confused where I should be specifying repo_id
and repo_type
when running a training job on Sagemaker, do I need to pass it through as a hyper parameter? or do I need to change the training script slightly? Any help would be appreciated, thanks!