GPTQ quantization on Custom dataset

Following the tutorial from optimum, it says a list of string can be passed as a dataset, but my question is how to pass dataset having multiple columns ?

dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."]
quantization = GPTQConfig(bits=4, dataset = dataset, tokenizer=tokenizer)
2 Likes

Any further progress on this?? I am facing same issue

I haven’t looked much on this , @VikrantKadam will look onto it later.

If you want to pass a dataset with multiple columns, you need to structure your dataset appropriately. A list of strings, as in your example, is suitable for a single-column dataset. However, for a dataset with multiple columns, each entry in the list can be a dictionary where the keys represent column names and the values are the corresponding data for each column.

Here’s an example:

pythonCopy code

dataset = [
    {"text": "auto-gptq is an easy-to-use model quantization library", "label": "positive"},
    {"text": "another example of a sentence", "label": "negative"},
    # Add more entries as needed
]

quantization = GPTQConfig(bits=4, dataset=dataset, tokenizer=tokenizer)

In this example, each entry in the dataset list is a dictionary with keys "text" and "label". You can have additional columns if needed. Adjust the column names and values based on the structure of your dataset.

Make sure that the column names you use in the dataset match the column names expected by the GPTQConfig or any other part of your code that processes the dataset. If you are using a custom class or function for dataset processing, you might need to modify it accordingly to handle the dictionary structure.