It appears that OwlV2 is much slower than OwlVit:
OwlVit:
pipeline = transformers.pipeline(model="google/owlvit-base-patch32", task="zero-shot-object-detection", device=device)
print("Total tiles:", total_tiles)
def yield_inputs():
for tile in tiles:
yield {
"image": Image.fromarray(tile.tile).convert("RGB"),
"candidate_labels": text_queries
}
outputs = pipeline(yield_inputs())
print(len(list(outputs)))
Results in:
Total tiles: 440
CPU times: user 1min 32s, sys: 560 ms, total: 1min 33s
Wall time: 25.6 s
While OwlV2:
pipeline = transformers.pipeline(model="google/owlv2-base-patch16-ensemble", task="zero-shot-object-detection", device=device)
print("Total tiles:", total_tiles)
def yield_inputs():
for tile in tiles:
yield {
"image": Image.fromarray(tile.tile).convert("RGB"),
"candidate_labels": text_queries
}
outputs = pipeline(yield_inputs())
print(len(list(outputs)))
Gives:
Total tiles: 440
CPU times: user 4min 11s, sys: 230 ms, total: 4min 11s
Wall time: 3min 3s
It seems like the biggest difference in these two pre-trained checkpoints is the patch size. If I understand correctly, the large patch size means fewer patches are processed, and therefore the inference is faster? Is this correct? Is there a reason OwlV2 doesn’t have a 32x32 model?