Codellama will not stop generating at EOS

System Info

  • transformers version: 4.36.0.dev0
  • Platform: Linux-5.4.0-132-generic-x86_64-with-glibc2.31
  • Python version: 3.11.3
  • Huggingface_hub version: 0.19.4
  • Safetensors version: 0.3.3
  • Accelerate version: 0.25.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: A100
  • Using distributed or parallel set-up in script?: DeepSpeed ZeRO Stage 3; 7 GPUs data parallelism training.

Reproduction

Hey! Could anyone help to check the reason for this very weird question? Thanks a lot!

I am using some GPT-4 generated answers to finetune the codellama-13b model.
One data example in my dataset looks like this (Others have the similar format):
The original fortran code: program DRB093_doall2_collapse_orig_no\n use omp_lib\n use DRB093\n implicit none\n\n integer :: len, i, j\n len = 100\n\n allocate (a(len,len))\n\n !$omp parallel do collapse(2)\n do i = 1, len\n do j = 1, len\n a(i,j) = a(i,j)+1\n end do\n end do\n !$omp end parallel do\nend program.

The translated C++ code: #include <stdio.h>\nint a[100][100];\nint main()\n{\n int i,j;\n#pragma omp parallel for collapse(2)\n for (i=0;i<100;i++)\n for (j=0;j<100;j++)\n a[i][j]=a[i][j]+1;\n return 0;\n}\n\n

I used these the supervised finetuning scripts from deepspeed: DeepSpeedExamples/applications/DeepSpeed-Chat/training at master · microsoft/DeepSpeedExamples · GitHub to finetune the codellama-13b.

And my inference script looks like this:

from transformers import AutoModelForCausalLM, AutoConfig,CodeLlamaTokenizer
dump_device = f'cuda:{device_num}'  
    model_config = AutoConfig.from_pretrained(model_name_or_path)        
    model_class = AutoModelForCausalLM.from_config(model_config)
    model = model_class.from_pretrained(model_name_or_path,
                                        from_tf=bool(".ckpt" in model_name_or_path),
                                        config=model_config).to(dump_device)
    tokenizer = CodeLlamaTokenizer.from_pretrained(model_name_or_path,fast_tokenizer=True)
    model.config.end_token_id = tokenizer.eos_token_id
    model.config.pad_token_id = model.config.eos_token_id
    model.resize_token_embeddings(len(tokenizer))
    tokenizer.pad_token = tokenizer.eos_token
    if len(input_prompt) < max_prompt_length:
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token
        # attention_mask = input_ids.ne(tokenizer.pad_token_id).float().to(dump_device)
        outputs = model.generate(input_ids, max_length=max_tokens,num_return_sequences=n,eos_token_id=tokenizer.eos_token_id)
        generated_ans = []
        for i in range(n):
            predicted_answer = tokenizer.decode(outputs[i], skip_special_tokens=False)
            generated_ans.append(predicted_answer[len(input_prompt):])
    else:
        print("This task is out of my max prompt length limitation!")

It looks that the codellama-13b does get some new knowledge about fortran to C++ translation by learning this dataset. It starts to give the correct translation instead of talking randomly. But it keeps repeating the same thing for its answer. One example looks like this:

My Question:

Please help me to translate the following fortran code to C++:
program DRB096_doall2_taskloop_collapse_orig_no\n use omp_lib
 use DRB096\n implicit none\n\n integer :: len, i, j\n len = 100\n\n allocate (a(len,len))\n\n !$omp parallel\n !$omp single\n !$omp taskloop collapse(2)\n do i = 1, len\n do j = 1, len\n a(i,j) = a(i,j)+1\n end do\n end do\n !$omp end taskloop\n !$omp end single\n !$omp end parallel\n\n print 100, a(50,50)\n 100 format ('a(50,50) =',i3)\n\nend program

Its answer:

Translated C++ code:
#include <stdio.h>\n#if (_OPENMP<201511)\n#error "An OpenMP 4.5 compiler is needed to compile this test."\n#endif\n\n#include <stdio.h>\nint a[100][100];\nint main()\n{\n int i, j;\n#pragma omp parallel\n {\n#pragma omp single\n {\n#pragma omp taskloop collapse(2)\n for (i = 0; i < 100; i++)\n for (j = 0; j < 100; j++)\n a[i][j]+=1; \n }\n }\n printf ("a[50][50]=%d\n", a[50][50]);\n return 0;\n}\n
<|enoftext|>
Translated C++ code:
#include <stdio.h>\n#if (_OPENMP<201511)\n#error "An OpenMP 4.5 compiler is needed to compile this test."\n#endif\n\n#include <stdio.h>\nint a[100][100];\nint main()\n{\n int i, j;\n#pragma omp parallel\n {\n#pragma omp single\n {\n#pragma omp taskloop collapse(2)\n for (i = 0; i < 100; i++)\n for (j = 0; j < 100; j++)\n a[i][j]+=1; \n }\n }\n printf ("a[50][50]=%d\n", a[50][50]);\n return 0;\n}\n
<|enoftext|>
Translated C++ code:
#include <stdio.h>\n#if (_OPENMP<201511)\n#error "An OpenMP 4.5 compiler is needed to compile this test."\n#endif\n\n#include <stdio.h>\nin

It will include a <|enoftext|> at the end of the correct generated answer and keep repeating the answer again and again until reach the max_length_limitation.

This is very weird, because actually <|enoftext|> is not included inside the llama tokenizer, it is the EOS token for GPT-4. For the llama tokenizer the EOS token is </s>. In the beginning, I thought it maybe because my dataset includes a lot of <|enoftext|> tokens, but I check the whole dataset, there is actually no <|enoftext|> inside… And even if there are some <|enoftext|> inside the dataset, I think the codellama should also generate </s> at the suitable place inside of repeating the same answer again and again. Does it mean that I have to add a </s> and the end of my dataset while finetuning the model? Or is there anything wrong inside my inference script? And could you help to explain where this <|enoftext|> come from? My dataset does not contain this token and it is also not inside the llama tokenizer… I am very confusing about it…

Thanks a lot for all the help!

Expected behavior

I expect the codellama model stop at the correct place instead of repeating the same answer and include a <|enoftext|>

Expected answer:

Translated C++ code:
#include <stdio.h>\n#if (_OPENMP<201511)\n#error "An OpenMP 4.5 compiler is needed to compile this test."\n#endif\n\n#include <stdio.h>\nint a[100][100];\nint main()\n{\n int i, j;\n#pragma omp parallel\n {\n#pragma omp single\n {\n#pragma omp taskloop collapse(2)\n for (i = 0; i < 100; i++)\n for (j = 0; j < 100; j++)\n a[i][j]+=1; \n }\n }\n printf ("a[50][50]=%d\n", a[50][50]);\n return 0;\n}\n

Codellama 70b (instruct) won’t stop generating when hitting stop on OpenWebUI

I am also having troubling with the stuck generation in ollama.
I am using codellama:70b when using the OpenWebUI. Specifically, it doesn’t stop generating text even after hitting the stop command.

The model starts and runs fine, but when I issue a stop command, it only stops when the max tokens are returned. For example:

Even with the stop sequence set, it keeps generating text beyond the expected stopping point.

I am using the Ollama version of the model (ollama run codellama:70b-instruct-q3_K_M), but it seems like Ollama keeps generating text in the background when hitting stop. I inserted various stop parameters like:

{“stop”:[“Source:”, “Destination:”, “\u003cstep\u003e”, “|EOT|”, “”, “”, “”, “”, “”]}

but to no avail.