Progress bar for HF pipelines

Hello everyone,

Is there a way to attach progress bars to HF pipelines? For example, in summarization pipeline I often pass a dozen of texts and would love to indicate to user how many texts have been summarized so far.

TIA,
Vladimir

Hello Vladimir :wave:

I saw this feature request where @Narsil says if you make your examples into a Hugging Face Dataset you can see the progress, like below:

dataset = MyDataset()

for out in tqdm.tqdm(pipe(dataset)):
    print(out)

class ListDataset(Dataset):
     def __init__(self, original_list)
        self.original_list = original_list

    def __len__(self):
        return len(self.original_list)

    def __getitem__(self, i):
        return self.original_list[i]

I don’t know of a way to do this without something like tqdm. (note that it adds extra complexity on top of your inference) below is my code.

from tqdm import tqdm
from transformers import pipeline

generator = pipeline(task="text-generation")
examples = [
        "Three Rings for the Elven-kings under the sky, Seven for the Dwarf-lords in their halls of stone",
        "Nine for Mortal Men, doomed to die, One for the Dark Lord on his dark throne",
    ]
 
for i in tqdm(range(len(examples))):
    generator(examples)

Maybe @osanseviero knows a better way of doing this.

2 Likes

Hey @merve , thanks a bunch. Here is a small example Colab notebook

1 Like

I couldn’t get this to work … could it be because my pipeline has a tokenizer built-in?

tokenizer = partial(AutoTokenizer.from_pretrained("results_820/checkpoint-10000/"), truncation=True)

def preprocess_data(data):
    encoding = tokenizer(data['text'], truncation=True)
    return encoding

model = AutoModelForSequenceClassification.from_pretrained("results_820/checkpoint-10000/", num_labels=2)

pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer)

validation_df = pd.read_csv("validation_set.csv")
validation_dataset = Dataset.from_pandas(validation_df)

for out in tqdm.tqdm(pipe(validation_dataset["text"])):
    print(out)

@afriedman412, you need to wrap your data into the torch dataset, not the huggingface dataset.

@vblagoje @afriedman412 I’m stuck in the same problem. I have a hugging face dataset where text example that I want to predict on has an id. (i.e. dataset[‘test’][index][‘text’]). I got this from a pandas dataframe. Could you guide me with an example of how to get a torch dataset? Thank you!

One workaround which I used:

results = []
CHUNK_SIZE = 100
for chunk in tqdm(range(test_df.shape[0] // CHUNK_SIZE + 1)):
    descr = test_df[CHUNK_SIZE * chunk: (CHUNK_SIZE+1) * chunk]['description'].to_list()
    res = nlp(descr)
    results += res
1 Like

Nice, this comment by @Maiia was very helpful. Thanks, this helped me see a 140% difference in my execution time for my code.

Also, adding device_map="auto" to the pipeline object ensures that the code will take advantage of whatever hardware config you may have. At least, my experience thus far

This is very helpful and solved my problem getting a tqdm progress bar working with an existing pipeline as well. One note:

I think the calculation of the data range based on chunk and CHUNK_SIZE is off. It should look something more like:

descr = test_df[(CHUNK_SIZE * chunk) : (CHUNK_SIZE * chunk) + CHUNK_SIZE]['description'].to_list()

Either way, thanks again @Maiia for the excellent template.

It could really be

descr = test_df[(CHUNK_SIZE * chunk) : CHUNK_SIZE * (chunk + 1)]['description'].to_list()

The problem was factorizing chunk rather than CHUNK_SIZE. Btw, it still complaints about not using a Dataset. If someone finds a way to get progressbar less hacky than this, please post it :slight_smile: