How to fine-tune a pre-trained model and then get the embeddings?

I would like to fine-tune a pre-trained model. This is the model:

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

This is the data (I know it is not clinical but let’s roll with it for now):

from fastai.datasets import untar_data, URLs
path = untar_data(URLs.IMDB_SAMPLE)
df = pd.read_csv(path/'texts.csv')

How can I fine-tune the above model with this data? I know the answer is here but I cannot figure it out.

I would then like to take the embeddings. I tried model.last_hidden_state (as I have seen outputs.last_hidden_state) but it does not work either.

Please, before asking questions look on the internet for a minute or two. This is a VERY common use case, as you may have expected. It takes us too much time to keep repeating all the same questions. Thanks.

The first hit that I got on Google already gives you a tutorial on fine-tuning: Fine-tuning a pretrained model — transformers 4.10.1 documentation

Second: Fine-tuning with custom datasets — transformers 4.10.1 documentation

Notebooks: 🤗 Transformers Notebooks — transformers 4.10.1 documentation

Of course, you cannot get the last hidden states as an attribute of the model. You first need to do a forward pass with some given data. From the output of the data you can then extract the last hidden state.


I also want something similar, I require embedding of a sentence, and before using Bert model, I want to fine tune it for my specific domain and then get the embeddings (pooler_output from AutoModel)

I’ve checked the links which you posted, and there fine tuning was done in a supervised fashion (Sequence classification). I want my bert model to just adapt my data, and want to fine tune it for a on simple text data (with no labels). I followed below approach, however, I don’t think it’s correct:

This was my plan

  1. Finetune Masked Language model for the specific domain
  2. Load this fine tuned model into AutoModel, and then get the embeddings

However, the embeddings which I got from the second steo doesn’t seem to be correct
(I did a simple check - for a particular sentence, computed the cosine similarity of standalone step2 and the similarity of step2 preceeded by step1, this similarity was low, which I think it shouldn’t be, as I just did a fine tuning with a small dataset (500 data points)

Code -

from transformers import AutoModelForMaskedLM

from transformers import AutoTokenizer

import datasets
from datasets import load_dataset
dataset=load_dataset("csv", data_files='.fintech_inputs_n500.csv', split=datasets.Split.TRAIN) x:bert_tokenizer(x['text'],  padding='max_length'), batched=True)
tok_oup=tok_oup.remove_columns('Unnamed: 0')
tok_oup.set_format("torch", columns=[ "input_ids", 'token_type_ids' ,'attention_mask'])

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=bert_tokenizer, mlm=True, mlm_probability=0.15

from transformers import Trainer, TrainingArguments
training_args = TrainingArguments("test-trainer3")

trainer = Trainer(


#step 2
from transformers import AutoModel

#step2 , if no step1
from transformers import AutoModel

inputs=[ "no you say that if i make a late payment there is no late fee"]
inputs=bert_tokenizer(inputs, padding='max_length', return_tensors='pt')

from torch.nn import CosineSimilarity
cos = CosineSimilarity(dim=0, eps=1e-6)
cos(auto_pooler[0], Masked_auto_pooler[0])