Repository Not Found Error when using custom dataset to train model on SageMaker


I am trying to train the GPT-2 model on sagemaker and have uploaded my training dataset to a private repo on Hugging Face.

Here is my code below

#Authentication to hugging face hub here
from huggingface_hub import notebook_login


hyperparameters = {
    'dataset_name' : 'E1l1dh/frl_training_dataset',
    'do_train': True,
    'do_eval': True,

# configuration for running training on smdistributed Model Parallel
mpi_options = {
    "enabled" : True,
    "processes_per_host" : 4,
smp_options = {
    "parameters": {
        "microbatches": 2,
        "placement_strategy": "spread",
        "pipeline": "interleaved",
        "optimize": "speed",
        "partitions": 2,
        "ddp": True,
#         "block_size" = 256,

    "smdistributed": {"modelparallel": smp_options},
    "mpi": mpi_options

# git configuration to download our fine-tuning script
git_config = {'repo': '','branch': 'v4.17.0'}

# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
    git_config = git_config,
	hyperparameters = hyperparameters,
#     metric_definitions = metric_definitions,
    output_path = output_bucket,
    base_job_name = 'GPT2-v1', 
    distribution = distribution

# starting the train job{'training':training_input_path,

And here is the error message I receive:

[1,mpirank:3,algo-1]<stderr>:    raise RepositoryNotFoundError(message, response) from e
[1,mpirank:3,algo-1]<stderr>:huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-63eb9c53-27586b8b2270acf375b087b8)
[1,mpirank:3,algo-1]<stderr>:Repository Not Found for url:
[1,mpirank:3,algo-1]<stderr>:Please make sure you specified the correct `repo_id` and `repo_type`.
[1,mpirank:3,algo-1]<stderr>:If the repo is private, make sure you are authenticated.
[1,mpirank:3,algo-1]<stderr>:Invalid username or password.

I am a little confused where I should be specifying repo_id and repo_type when running a training job on Sagemaker, do I need to pass it through as a hyper parameter? or do I need to change the training script slightly? Any help would be appreciated, thanks!

cc @philschmid

If it is a private repository, you have to provide your token, which i cannot see you doing here.

I am authenticating here?

from huggingface_hub import notebook_login


I am also passing through the token when I load the dataset:

data_files = {"train": 'frl_training_input_dataset/fine_tuning_dataset_train.csv', 
              "test": 'frl_training_input_dataset/fine_tuning_dataset_test.csv'}

dataset = load_dataset("E1l1dh/frl_training_dataset", data_files=data_files, use_auth_token=access_token)

@philschmid Where else should I be providing the token? I can see push_to_hub but I understood this to be pushing the model after training