Num_samples = 0, dataset not being read

I’m not really sure what the issue is with my dataset, or if its the argument im passing to autotrain. I’ve looked at other datasets(alpaca gpt4) and it looks correctly organized.

Hi there @smahorker,
I think the problem is your dataset formatting. Looking into the .csv file indicates to me that not all of your strings were parsed correctly, ie missing the “ “

But I can be wrong as I haven’t worked with that interface before. Hope it helps
Best,
Mike

Hi,

Thank you for the response.

I’ve updated my dataset so that human input was also in " ", which seems to have uniformed my csv. However I’m still facing the same issue. I’m not sure why my datasets formatting is off and the autotrainer runs fine on other data.

I would recommend to first read-in your dataset, ie in a Jupyter notebook and try to get some data from your dataset.
Looking at your csv file again, it still doesnt look right to me.

I would imagine that:

ds=load_dataset(“smahorker/discllm”)

Will throw an error due to the formatting issues.

Maybe it would help to understand the exact format that is required by the Llama2 model here.

Then I would take a step back and check

  1. data is loaded in correctly (printing, visualize)
  2. search for the Llama2 format, update your dataset (if necessary). Then send an example of your data to the model via transformers
  3. check that the models output tensor shape aligns with what is expected
    Good luck!