Using fine tuned model for inference

Yanith · January 25, 2024, 12:18pm

Hello guys,

I am very new to this so I apologise if this is a dumb question, but for some reason, it seems to be pretty hard for me to just use an already trained transformer model to make inference.

So basically, what I am trying to do is to apply ClimateBERT to a dataset which I got from another author (ECOLEX_Legislation.csv).

I have encountered quite a few issues in the process of doing this, but managed to somewhat solve them (I think). However, as of now, after executing the code below, it has been running for a few hours. I am not too sure why. So I would appreciate it if someone could help me with this.

from transformers import pipeline, AutoTokenizer
from datasets import load_dataset
from transformers.pipelines.pt_utils import KeyDataset
df=load_dataset("csv",data_files="ECOLEX_Legislation.csv", delimiter=",", split="train")
for out in pipe(KeyDataset(df, "Policy_Content"), truncation=True, max_length=512):
    print(out)

Note that, I think that the model works fine if I used pipe it over one sentence. However, I think it becomes a problem when I pipe it over an entire dataset. Secondary to this, I was wondering if there are any complete tutorials that I could refer to (from loading dataset to analysing and visualising the data).

Thank you so much for reading this and if you guys need more information, I am happy to elaborate on it.

Warm regards,
Yanith

Yanith · January 25, 2024, 2:10pm

Update: I am now trying to run this code (just for the sake for trying something new). It has been 30 minutes already and it is running so I am not sure what is wrong. Also, even though I used tqdm, it did not show any progress bar.

from tqdm.auto import tqdm
from transformers import pipeline, AutoTokenizer
from datasets import load_dataset
from transformers.pipelines.pt_utils import KeyDataset
df=load_dataset("csv",data_files="ECOLEX_Legislation.csv", delimiter=",", split="train")
#Remove missing values 
missing_values = df.filter(lambda example: example['Policy_Content'] is None)
print("Number of missing values:", len(missing_values))
# Define a function that checks if the data in your column of interest is not missing
def is_not_missing(example):
    return example['Policy_Content'] is not None

# Apply this function to filter out rows with missing values
filtered_dataset = df.filter(is_not_missing)
pipe = pipeline("text-classification", model="climatebert/distilroberta-base-climate-sentiment")
tokenizer = AutoTokenizer.from_pretrained("climatebert/distilroberta-base-climate-sentiment")

tokenizedfiltered_dataset = tokenizer(filtered_dataset["Policy_Content"], truncation=True, max_length=512)
for out in tqdm(pipe(KeyDataset(filtered_dataset, "Policy_Content"), truncation=True, max_length=512), total=len(filtered_dataset)):
    print(out)

dhruvvaidh · July 5, 2024, 6:15am

Hi Yanith,

To train llms using custom datasets, it always more convenient to convert your existing dataset into a huggingface dataset using the Dataset and DatasetDict classes. Pushing the dataset onto huggingface hub allows you to reuse it.

from datasets import Dataset, DatasetDict
final_dataset = DatasetDict(
    {
        'train': Dataset.from_pandas(train),
        'test': Dataset.from_pandas(test)
    }
)

final_dataset.push_to_hub('your_huggingface_username/dataset_name',token='your_secret_api_key')

You can even explore training methods like SFTTrainer for training or fine-tuning LLMs.

Hope this helps!

Topic		Replies	Views
Model inference on tokenized dataset 🤗Datasets	2	6326	March 22, 2023
Any advice on LLM inference over a large dataset? 🤗Transformers	0	785	August 16, 2023
Using Trainer at inference time 🤗Transformers	9	15943	May 4, 2023
Fine Tuning BERT model on custom dataset 🤗Transformers	3	1192	January 27, 2022
What’s the Best Way to Fine-Tune a Transformer Model on a Custom Dataset Using the Transformers Library? 🤗Transformers	1	30	July 31, 2025

Using fine tuned model for inference

Related topics