AutoTrain csv data format

Yoshkeen · November 23, 2023, 3:27pm

I have a question about AutoTrain csv data format.

Documentaion says:

LLM finetuning accepts data in CSV format.

Data Format For SFT / Generic Trainer
For SFT / Generic Trainer, the data should be in the following format:

| text | | This is the first sentence. | | This is the second sentence. |

I lave looked at example dataset and tried to structure my csv as there:

text
“### Human: question text ### Assistant: answer text”
“### Human: question text ### Assistant: answer text”

but this gives me errors about: ValueError: 3 columns passed, passed data had 0 columns

Any advice?

PS: I was trying to train meta-llama/Llama-2-13b-chat-hf

abhishek · November 23, 2023, 3:52pm

The CSV should have one column named text and all the data under it. The doc had formatting issue, ive fixed it and changes will reflect in 10-15mins.

Anne314159 · March 1, 2024, 2:36pm

Did you solved it?

GlennLR · March 8, 2024, 6:17am

I just cant bring this autotrain advanced to work even with the example data, its always error 500, ERROR: Exception in ASGI application

Looks like autotrain advanced as a no code solution is a dead project for hugging face. No one seems to care from what i can find everywhere

abhishek · March 8, 2024, 6:37am

lol. use github issues and explain with errors. then someone will help. help us to help you. just saying 500 doesnt help me. autotrain is active and will always be amd is training thousands of models everyday.

GlennLR · March 8, 2024, 6:57am

@abhishek If i understand it correctly you are very important here, therefore i will start with an apology: Not wanting to insult anyone in case that happened, i am just a little frustrated. I gladly will explain what I tried to do: From my understanding, the setup on hugging face is meant to be a no code solution for beginners (which i certainly am). I created a space on hugging face and got this ui here:

From the documentation here LLM Finetuning i downloaded the linked example data set (timdettmers/openassistant-guanaco · Datasets at Hugging Face) and pressed start. That did not work with above mentioned error. I tried to create a csv file with a text column and added some texts from the above dataset. Did not work either. I made a csv file with the little example on the documentation itself. Same issue.

The error message i provided above is the one from the log tab.

If the whole tracestack helps, here it is:

ERROR: Exception in ASGI application
Traceback (most recent call last):
File “/app/env/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py”, line 428, in run_asgi
result = await app( # type: ignore[func-returns-value]
File “/app/env/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py”, line 78, in call
return await self.app(scope, receive, send)
File “/app/env/lib/python3.10/site-packages/fastapi/applications.py”, line 1106, in call
await super().call(scope, receive, send)
File “/app/env/lib/python3.10/site-packages/starlette/applications.py”, line 122, in call
await self.middleware_stack(scope, receive, send)
File “/app/env/lib/python3.10/site-packages/starlette/middleware/errors.py”, line 184, in call
raise exc
File “/app/env/lib/python3.10/site-packages/starlette/middleware/errors.py”, line 162, in call
await self.app(scope, receive, _send)
File “/app/env/lib/python3.10/site-packages/starlette/middleware/exceptions.py”, line 79, in call
raise exc
File “/app/env/lib/python3.10/site-packages/starlette/middleware/exceptions.py”, line 68, in call
await self.app(scope, receive, sender)
File “/app/env/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py”, line 20, in call
raise e
File “/app/env/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py”, line 17, in call
await self.app(scope, receive, send)
File “/app/env/lib/python3.10/site-packages/starlette/routing.py”, line 718, in call
await route.handle(scope, receive, send)
File “/app/env/lib/python3.10/site-packages/starlette/routing.py”, line 276, in handle
await self.app(scope, receive, send)
File “/app/env/lib/python3.10/site-packages/starlette/routing.py”, line 66, in app
response = await func(request)
File “/app/env/lib/python3.10/site-packages/fastapi/routing.py”, line 274, in app
raw_response = await run_endpoint_function(
File “/app/env/lib/python3.10/site-packages/fastapi/routing.py”, line 191, in run_endpoint_function
return await dependant.call(**values)
File “/app/env/lib/python3.10/site-packages/autotrain/app.py”, line 459, in handle_form
dset = AutoTrainDataset(**dset_args)
File “”, line 13, in init
File “/app/env/lib/python3.10/site-packages/autotrain/dataset.py”, line 204, in post_init
self.train_df, self.valid_df = self._preprocess_data()
File “/app/env/lib/python3.10/site-packages/autotrain/dataset.py”, line 213, in _preprocess_data
train_df.append(pd.read_csv(file))
File “/app/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py”, line 1026, in read_csv
return _read(filepath_or_buffer, kwds)
File “/app/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py”, line 626, in _read
return parser.read(nrows)
File “/app/env/lib/python3.10/site-packages/pandas/io/parsers/readers.py”, line 1923, in read
) = self._engine.read( # type: ignore[attr-defined]
File “/app/env/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py”, line 234, in read
chunks = self._reader.read_low_memory(nrows)
File “parsers.pyx”, line 838, in pandas._libs.parsers.TextReader.read_low_memory
File “parsers.pyx”, line 905, in pandas._libs.parsers.TextReader._read_rows
File “parsers.pyx”, line 874, in pandas._libs.parsers.TextReader._tokenize_rows
File “parsers.pyx”, line 891, in pandas._libs.parsers.TextReader._check_tokenize_status
File “parsers.pyx”, line 2061, in pandas._libs.parsers.raise_parser_error

I also tried factory rebuilds and set up the space three times. Always the same issue. I tried different models and both LLM Generic and SFT. I tried to play around with the parameters which i honestly do not understand at all and the link in the label (find params to copy paste here) does not work.

When i tried to get help outside the documentation and official material itself I find out that there seems to be a better product a few months ago which worked like what i had expected to encounter. The whole UI seemed easier and more intuitive while people in youtube videos just made them work. I have found no material online with the ui i encountered. The few things i find are basically frustrated people like me.

The out of the box experience for beginners like me is unfortunately very unpleasant due to me having hoped to get a first start here in training models. For me nothing seems to work and while i absolutely am sure that it probably works if you know a little more about how it supposed to work or atleast python - for an advertised easy to use product its pretty frustrating.

abhishek · March 8, 2024, 7:17am

It seems like a data formatting issue. the problem is the example dataset needs to be converted to csv. Here, im providing a correctly formatted csv for llm sft/generic task that you can use. please try it and let me know if it worked for you. from the screenshot, it seems like you are not adding any gpu to the space, this might cause issues and failures too as LLM tasks need GPU to run properly.

example csv: autotrain-example-datasets/alpaca1k.csv at main · huggingface/autotrain-example-datasets · GitHub

PS: we are always willing to help but we also need details in order to figure out what might have gone wrong. with your response it seems like documentation is also not good enough. so, ill work on improving it too this weekend.

GlennLR · March 8, 2024, 7:26am

Thanks! It works - now i can look into the file compare it with my own and see what i did wrong and work with that. Really appreciate it. Have a great day!

If i may suggest something for a documentation: A simple step by step explanation with working example configuration already formatted example data that runs out of the box. This file is a great beginning.

abhishek · March 8, 2024, 7:53am

If i may suggest something for a documentation: A simple step by step explanation with working example configuration already formatted example data that runs out of the box. This file is a great beginning.

will be done by this weekend.

marmotzero · March 21, 2024, 7:28pm

Is this the file that you made?

It took me a very long time to find (after I had similar issues to the OP). It would be helpful if it were also referenced in this tutorial page, since this tutorial does not cover data formatting at all:

Topic		Replies	Views
Autotrain LLM fine tuning data mapping problem 🤗AutoTrain	0	486	July 5, 2023
autoTrain data format for SFT fine tuning 🤗AutoTrain	0	44	August 30, 2024
Num_samples = 0, dataset not being read Beginners	4	336	December 7, 2023
Cannot upload CSV or JSONLines To Autotrain 🤗AutoTrain	2	898	May 10, 2023
How to fine-tune an LLM with AutoTrain? 🤗AutoTrain	5	2853	March 3, 2024

AutoTrain csv data format

Related topics