question 1)

For a binary classification problem I could use `num_labels`

as 1 (positive or not) or 2 (positive and negative). Is there any guideline regarding which setting is better? It seems that if we use 1 then probability would be calculated using `sigmoid`

function and if we use 2 then probabilities would be calculated using `softmax`

function.

question 2)

In both cases are my y labels going to be same? each data point will have 0 or 1 and not one hot encoding? For example, if I have 2 data points then y would be `0,1`

and not `[0,0],[0,1]`

I have very unbalanced classification problem where class 1 is present only 2% of times. In my training data I am oversampling