LLAMA-2 Named Entity Recognition

rajat-saxena · August 4, 2023, 10:41am

Is LLAMA-2 a good choice for named entity recognition?
Is there an example that I can use to use PEFT on LLAMA-2 for NER?
Thanks !

ezsolt · September 5, 2023, 7:24pm

I am also interested in this question. Did you find an answer?

mayur456 · September 14, 2023, 5:55am

I have tried it but for inference on test set after fine tuning requires huge GPU. Before fine tuning it throws the entities which are not from custom document.Many times it will gives the output from pretrained model when we pass the prompt to model.

lapp0 · November 22, 2023, 2:18am

@rajat-saxena Llama 2 and other open source language models are great for NER. I’ve seen personal success using both Llama 2 and even better results with Mistral. I’m currently experimenting with Yi, as it is the SOTA weights-public foundation model for reading comprehension.

The authors of https://arxiv.org/pdf/2310.01208.pdf found that using a finetuned Llama 2 7B beat the previous SOTA (RoBERTa-Large ) on all tasks. They utilize LoRA and GitHub - 4AI/LS-LLaMA: Official Code For Label Supervised LLaMA Finetuning Worth mentioning they achieve the best results by removing the causal mask (unLLaMa), but that may be out of scope for your project, as supporting inference would be require additional work.

If you have any further general questions, feel free to respond. If you want in depth support you may contact me.

@mayur456 it sounds like you aren’t actually merging your adapters and instead you’re using the base model, or your model wasn’t actually trained for another reason. How does your model score against the baseline? What are your training parameters?

mayur456 · November 22, 2023, 5:38am

@lapp0 I have tried the default training parameters from base model both 7B and 7B-chat for fine tuning. Due to unavailability of GPU I moved to vector database such as Pinecone for single document and it gave good results with less GPU but more important was not all named entities were captured from complex document. Then I have gone through experimentation OpenAI API keys then got good results with this. But question is remaining about the finetuning with minimum GPU.
Thank you so much these resources https://arxiv.org/pdf/2310.01208.pdf and GitHub - 4AI/LS-LLaMA: Official Code For Label Supervised LLaMA Finetuning
I will go through it and update you what results I will get with new resource for my business usecase.

mayur456 · November 22, 2023, 5:44am

@lapp0 Can you let me know what was your hardware configuration for this experimentation of Llama-2 for getting named entities from complex document?

lapp0 · November 22, 2023, 9:29am

It depends on your sequence length. I trained on up to 8192 tokens (effectively 4096 w/ Mistral) which resulted in about 70GB VRAM during training. Therefore I needed A100 80GB, H100, or multiple GPUs. I went with a cloud H100.

If you limit to 512 tokens you can probably get away with a single consumer grade GPU with 24GB VRAM.

mayur456 · November 24, 2023, 4:56am

Actually, I had got the once access of Google Colab GPU A100 40GB and tried the token size of 512 , 2000 I got some results but that were too hallucinating. When I disconnected then it was difficult to get the access once again for A100. But for legal documents we can’t keep less token size of 512 if we do so it will lose the context of the document because it is a complex document and llama-2 will be hallucinating after finetuning. I don’t know how much do you cost for A100 80GB per hour basis. If it is less then it is fine. So for getting access was difficult that’s why I went to OpenAI API keys with Langchain framework and cost was less as compared to GPU offered by Google Colab. Can you please let me know the cost of GPU for fine tuning on hour basis for whatever token size you have chosen?

lapp0 · November 29, 2023, 8:08am

H100 rented for $4.70/h. though I’m experimenting with a smaller context window and a larger dataset on a cheaper GPU.

Are you using a proprietary dataset? How many samples does it have? Is it high quality?

If you’re constrained by memory, you might considered either fine tuning a BERT model designed for NER

(e.g. GitHub - Michael-Beukman/NERTransfer: Investigating transfer learning in low-resourced languages, specifically in a named entity recognition (NER) task (IJCNLP-AACL 2023). http://arxiv.org/abs/2309.05311 and GitHub - Spico197/Mirror: 🪞A powerful toolkit for almost all the Information Extraction tasks.)

, or splitting up your documents into overlapping 512-token chunks. It cost me roughly 100 H100-hours (~$500) to finetune on ~250,000 samples ranging from 50 tokens to 8192 tokens for 2 epochs.

mayur456 · November 30, 2023, 3:20am

@lapp0 Thank you so much your kind insights about the cost required.
I am using the legal dataset from court judgement which is complex in nature and has high context, I kept the token_size of 256 only for each of the sample. I have used only 50 training samples. I mean to say I have 50 rows in dataset in each row I have around 256 tokens. I passed this dataset to Llama-2 model for training. My major concern is GPU, I think I need to upgrade the GPU to A100.

Is complex dataset consume more memory? or Do we need to clean it in proper ways?

lapp0 · December 6, 2023, 9:33pm

I’m not sure if 50 is enough data points. I’d recommend augmenting your dataset. Maybe 50 works, but I’ve never trained on so few.

For QLoRA with a 7B or 13B model, 4090 may be fine. If you’re OOMing, you’ll need to upgrade naturally.

I’d focus on your data first and foremost.

Additionally, considering the size of your dataset, you might consider looking into existing NER implementations using smaller BERT variants. Perhaps either there is a model which already does NER you can finetune on.

Or you might take an existing dataset on huggingface e.g. https://huggingface.co/datasets/jfrenz/legalglue and prune entities you’re not interested in. I’m not sure what the nature of your task is exactly, but there are a few ways to supplement your data.

Topic		Replies	Views
Llama2 fine-tunning with PEFT QLora and testing the model 🤗Transformers	13	15203	December 21, 2023
LLaMa3.1 8B Instruct Prompt Tuning for Text Classification doesn't improve test accuracy Models	3	753	October 1, 2024
Issues when fine tuning Llama-3.2-11B-Vision Beginners	4	98	May 8, 2025
Llama-2 Sequence Classification: Much lower accuracy on inference from checkpoint compared to model 🤗Transformers	5	5938	February 20, 2024
Fine-tuning don't work / bad results Beginners	5	1661	January 15, 2025

LLAMA-2 Named Entity Recognition

Related topics