Train end-to-end text classication on sagemaker


I’m following the guidence on training text-classfication using my own dataset,
refer to notebooks/sagemaker-notebook.ipynb at master · huggingface/notebooks · GitHub

I have two questions:

  • should the dataset contain label column only support int? in other words, I need to preprocess my data, convert categories to 1,2,3…?
  • do I need to specify the class number? if so, where?



Hey @jackieliu930,

  1. Yes, the labels need to be int values
  2. Yes you need to modify the .from_pretrained method here: notebooks/ at 3fdb8bd61ed2f2b499dcd55034b1ee58be5cfabb · huggingface/notebooks · GitHub

You could also use the from examples using the git_config then you don’t need to provide your own training script.

got it! super thanks! btw, I am wondering, why it seems that, with same parameter setting (epoch/ batch size), on same dataset, it appears that OOM happens when I use huggingfaceXsagemaker sdk, while works well with original pytorch sdk?
any clue on this one?

Did you use the same model, same dataset, same epoch & batch_size for train and eval, same instance type?

yes. totally the same.

Also same Pytorch and Transformers version?