GPTQ quantization on Custom dataset

Saugatkafley · December 27, 2023, 9:17am

Following the tutorial from optimum, it says a list of string can be passed as a dataset, but my question is how to pass dataset having multiple columns ?

dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."]
quantization = GPTQConfig(bits=4, dataset = dataset, tokenizer=tokenizer)

VikrantKadam · January 15, 2024, 8:39am

Any further progress on this?? I am facing same issue

Saugatkafley · January 17, 2024, 9:33am

I haven’t looked much on this , @VikrantKadam will look onto it later.

Jamesshaw · January 17, 2024, 9:54am

If you want to pass a dataset with multiple columns, you need to structure your dataset appropriately. A list of strings, as in your example, is suitable for a single-column dataset. However, for a dataset with multiple columns, each entry in the list can be a dictionary where the keys represent column names and the values are the corresponding data for each column.

Here’s an example:

pythonCopy code

dataset = [
    {"text": "auto-gptq is an easy-to-use model quantization library", "label": "positive"},
    {"text": "another example of a sentence", "label": "negative"},
    # Add more entries as needed
]

quantization = GPTQConfig(bits=4, dataset=dataset, tokenizer=tokenizer)

In this example, each entry in the dataset list is a dictionary with keys "text" and "label". You can have additional columns if needed. Adjust the column names and values based on the structure of your dataset.

Make sure that the column names you use in the dataset match the column names expected by the GPTQConfig or any other part of your code that processes the dataset. If you are using a custom class or function for dataset processing, you might need to modify it accordingly to handle the dictionary structure.

DebasishDhal99 · January 24, 2025, 3:58pm

Tried this, apparently formatting the custom data rows into dict format isn’t enough, the code is also getting stuck because it needs “input_ids” and “attention_mask” for every such row. Might be wrong.

Topic		Replies	Views
Dataset Configs with Different Columns 🤗Datasets	2	540	March 4, 2024
Quantization GPTQ 🤗Optimum	1	231	May 21, 2024
How to pass a pipeline over a dataset with multiple columns 🤗Datasets	4	1029	September 6, 2023
Question answering bot: fine-tuning with custom dataset Beginners	6	6020	June 23, 2022
Column names of custom dataset for use with trainer Beginners	3	5436	March 31, 2024

GPTQ quantization on Custom dataset

Related topics