hi,
I’m following the guidence on training text-classfication using my own dataset,
refer to notebooks/sagemaker-notebook.ipynb at master · huggingface/notebooks · GitHub
I have two questions:
- should the dataset contain label column only support int? in other words, I need to preprocess my data, convert categories to 1,2,3…?
- do I need to specify the class number? if so, where?
thanks!
jackie
Hey @jackieliu930,
- Yes, the labels need to be int values
- Yes you need to modify the
.from_pretrained
method here: notebooks/train.py at 3fdb8bd61ed2f2b499dcd55034b1ee58be5cfabb · huggingface/notebooks · GitHub
You could also use the run_glue.py
from examples using the git_config
then you don’t need to provide your own training script.
got it! super thanks! btw, I am wondering, why it seems that, with same parameter setting (epoch/ batch size), on same dataset, it appears that OOM happens when I use huggingfaceXsagemaker sdk, while works well with original pytorch sdk?
any clue on this one?
Nice!
Did you use the same model, same dataset, same epoch & batch_size for train and eval, same instance type?
Also same Pytorch and Transformers version?