Why am I not getting the exact output of 4-bit quantization using NF4?

neeraj1909 · September 29, 2024, 10:01am

I was going through the 4-bit quantization using this article.

To make the understanding clear, the authors have written two codes:

First one was implemented from scratch
and the Second one was bitsandbytes implementation.

I got the exact output using the first one (implemented from scratch), but I did not get the same output using the bitsandbytes library.

My code link: Google Colab

Can anyone tell me the reason behind this?

John6666 · September 30, 2024, 5:17am

!pip install -U bitsandbytes

It is possible that there is some mathematically serious reason for the difference, but if the output is simply different, it could be a difference in library versions.

Since neither the author of the article nor the Colab code specifies a version, the most recent stable version will be installed, but this is only the version that would be the least annoying in practical use, and does not guarantee any other identity.
In the past, there were actual cases where even the format strictly defined in the specifications was not followed in the company’s own implementation.
Isn’t it most likely that the author of the article, the old bitsandbytes and the current bitsandbytes are all generally correct but only slightly different? Otherwise, one or more of them is buggy, but if it is a bug with major practical problems, one of the users will notice it.

The core code seems to be the same as it was 7 months ago, which seems unlikely, but library behavior is a nonsense thing.

github.com

bitsandbytes-foundation/bitsandbytes/blob/ffd7d0db6a660c97b60a2c9605309ee4b5cd40e3/csrc/kernels.cu#L3319


      
          template <typename T> __device__ void printnonzero(T *A, int num_values, const char * strval)
          {
            for(int i = 0; i < num_values; i++)
              if((float)A[i] != 0.0)
                printf("%s %i %f\n", strval, i, (float)A[i]);
          }
          
          template __device__ void printnonzero<float>(float *A, int num_values, const char*strval);
          template __device__ void printnonzero<half>(half *A, int num_values, const char*strval);
          
          __device__ static float nf4_data[16] = {-1.0, -0.6961928009986877, -0.5250730514526367, -0.39491748809814453, -0.28444138169288635, -0.18477343022823334, -0.09105003625154495, 0.0, 0.07958029955625534, 0.16093020141124725, 0.24611230194568634, 0.33791524171829224, 0.44070982933044434, 0.5626170039176941, 0.7229568362236023, 1.0};
          template <typename T, int THREADS> __global__ void kgemm_4bit_inference(int M, int N, int K, T * __restrict__ const A, unsigned char *B,  float *absmax, T * out,  int lda, int ldb, int ldc, int blocksize)
          {
          
          #if __CUDA_ARCH__ >= 750
          	using namespace nvcuda;
            int col_offset = blockIdx.x *32;
            const int warp_id = threadIdx.x / 32;
            const int warp_idx = threadIdx.x % 32;
            const int half_warp_id = threadIdx.x / 16;
            const int half_warp_lane = threadIdx.x % 16;

neeraj1909 · September 30, 2024, 6:27pm

There is a massive difference in the output. Still, I am unable to figure it out.

John6666 · September 30, 2024, 11:19pm

If so, it is possible that the article’s author’s implementation or the implementation by bitsandbytes is not following the theory, and that the conversion and inverse conversion can be done, but not work in the actual model…?

NF4 is a format that is getting a lot of attention, so there may be others besides that author who have tried to analyze it independently. Such a sample might provide a clue to the cause of the problem.
Alternatively, you could try to run it on the actual model, but since there must still be some problems with the official torch support, this method may be more difficult to understand, since other problems may be involved.

Topic		Replies	Views
Diff between GPTQ and NF4 with bitsandbytes 🤗Transformers	0	1257	August 1, 2023
4-bit quantization Intermediate	0	475	November 18, 2023
Inference 8 bit or 4 bit bit models on cpu? Beginners	2	3161	August 3, 2023
Deepspeed inference and infinity offload with bitsandbytes 4bit loaded models DeepSpeed	2	3890	July 27, 2023
Qlora - 8 bit quantization using bitsandbytes gives error for owl-vit model Intermediate	1	499	April 12, 2024

Why am I not getting the exact output of 4-bit quantization using NF4?

Related topics