OwlV2 significantly slower than OwlVit

It appears that OwlV2 is much slower than OwlVit:

OwlVit:

pipeline = transformers.pipeline(model="google/owlvit-base-patch32", task="zero-shot-object-detection", device=device)
print("Total tiles:", total_tiles)

def yield_inputs():
    for tile in tiles:
        yield {
            "image": Image.fromarray(tile.tile).convert("RGB"),
            "candidate_labels": text_queries
        }

outputs = pipeline(yield_inputs())
print(len(list(outputs)))

Results in:

Total tiles: 440
CPU times: user 1min 32s, sys: 560 ms, total: 1min 33s
Wall time: 25.6 s

While OwlV2:

pipeline = transformers.pipeline(model="google/owlv2-base-patch16-ensemble", task="zero-shot-object-detection", device=device)
print("Total tiles:", total_tiles)

def yield_inputs():
    for tile in tiles:
        yield {
            "image": Image.fromarray(tile.tile).convert("RGB"),
            "candidate_labels": text_queries
        }

outputs = pipeline(yield_inputs())
print(len(list(outputs)))

Gives:

Total tiles: 440
CPU times: user 4min 11s, sys: 230 ms, total: 4min 11s
Wall time: 3min 3s

It seems like the biggest difference in these two pre-trained checkpoints is the patch size. If I understand correctly, the large patch size means fewer patches are processed, and therefore the inference is faster? Is this correct? Is there a reason OwlV2 doesn’t have a 32x32 model?