Repository Not Found Error when using custom dataset to train model on SageMaker

Hello,

I am trying to train the GPT-2 model on sagemaker and have uploaded my training dataset to a private repo on Hugging Face.

Here is my code below

#Authentication to hugging face hub here
from huggingface_hub import notebook_login

notebook_login()

hyperparameters = {
    'model_name_or_path':'gpt2',
	'output_dir':'/opt/ml/model',
    'dataset_name' : 'E1l1dh/frl_training_dataset',
    'do_train': True,
    'do_eval': True,
    'per_device_eval_batch_size':2,
    'per_device_train_batch_size':2,
    'gradient_accumulation_steps':8,
}

# configuration for running training on smdistributed Model Parallel
mpi_options = {
    "enabled" : True,
    "processes_per_host" : 4,
}
smp_options = {
    "enabled":True,
    "parameters": {
        "microbatches": 2,
        "placement_strategy": "spread",
        "pipeline": "interleaved",
        "optimize": "speed",
        "partitions": 2,
        "ddp": True,
#         "block_size" = 256,
    }
}

distribution={
    "smdistributed": {"modelparallel": smp_options},
    "mpi": mpi_options
}

# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.17.0'}

# creates Hugging Face estimator
huggingface_estimator = HuggingFace(
	entry_point='run_clm.py',
	source_dir='./examples/pytorch/language-modeling',
	instance_type='ml.p3.8xlarge',
    git_config = git_config,
	instance_count=1,
    role=role,
	transformers_version='4.17.0',
	pytorch_version='1.10.2',
	py_version='py38',
	hyperparameters = hyperparameters,
#     metric_definitions = metric_definitions,
    output_path = output_bucket,
    base_job_name = 'GPT2-v1', 
    distribution = distribution
)

# starting the train job
huggingface_estimator.fit(inputs={'training':training_input_path,
                                 'test':test_input_path})

And here is the error message I receive:

[1,mpirank:3,algo-1]<stderr>:    raise RepositoryNotFoundError(message, response) from e
[1,mpirank:3,algo-1]<stderr>:huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-63eb9c53-27586b8b2270acf375b087b8)
[1,mpirank:3,algo-1]<stderr>:
[1,mpirank:3,algo-1]<stderr>:Repository Not Found for url: https://huggingface.co/api/datasets/E1l1dh/frl_training_dataset.
[1,mpirank:3,algo-1]<stderr>:Please make sure you specified the correct `repo_id` and `repo_type`.
[1,mpirank:3,algo-1]<stderr>:If the repo is private, make sure you are authenticated.
[1,mpirank:3,algo-1]<stderr>:Invalid username or password.
------------------------------------------------------------

I am a little confused where I should be specifying repo_id and repo_type when running a training job on Sagemaker, do I need to pass it through as a hyper parameter? or do I need to change the training script slightly? Any help would be appreciated, thanks!

cc @philschmid

If it is a private repository, you have to provide your token, which i cannot see you doing here.

I am authenticating here?

from huggingface_hub import notebook_login

notebook_login()

I am also passing through the token when I load the dataset:

data_files = {"train": 'frl_training_input_dataset/fine_tuning_dataset_train.csv', 
              "test": 'frl_training_input_dataset/fine_tuning_dataset_test.csv'}

dataset = load_dataset("E1l1dh/frl_training_dataset", data_files=data_files, use_auth_token=access_token)

@philschmid Where else should I be providing the token? I can see push_to_hub but I understood this to be pushing the model after training