I am running inference on multiple GPUs. But before that, I have to encode input video.
Below is my code to do it:
# setup accelerator to use multiple GPUs
from accelerate import Accelerator
accelerator = Accelerator()
accelerator.state.num_processes = 2
accelerator.state.distributed_type = "MULTI_GPU"
device = accelerator.device
# get video stream
probe = ffmpeg.probe(args.video_example)
video_stream = next(
(stream for stream in probe["streams"] if stream["codec_type"] == "video"), None
)
# resize to have smaller dimension = 224, but maintain aspect ratio
width = int(video_stream["width"])
height = int(video_stream["height"])
num, denum = video_stream["avg_frame_rate"].split("/")
frame_rate = int(num) / int(denum)
if height >= width:
h, w = int(height * 224 / width), 224
else:
h, w = 224, int(width * 224 / height)
assert frame_rate >= 1
cmd = ffmpeg.input(args.video_example).filter("fps", fps=1).filter("scale", w, h)
x = int((w - 224) / 2.0)
y = int((h - 224) / 2.0)
cmd = cmd.crop(x, y, 224, 224)
out, _ = cmd.output("pipe:", format="rawvideo", pix_fmt="rgb24").run(
capture_stdout=True, quiet=True
)
# preprocess video and shift preprocessed frames to GPU
h, w = 224, 224
video = np.frombuffer(out, np.uint8).reshape([-1, h, w, 3])
video = torch.from_numpy(video.astype("float32"))
video = video.permute(0, 3, 1, 2)
video = video.squeeze()
video = preprocess(video)
with torch.no_grad():
video = backbone.encode_image(video.to(device))
But I get CUDA out of memory error, while one of the GPU is completely unused.
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.67 GiB (GPU 0; 14.76 GiB total capacity; 9.96 GiB already allocated; 1.90 GiB free; 11.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The origin of the error from stack trace is
with torch.no_grad():
video = backbone.encode_image(video.to(device))
Am I doing something wrong, or encoding is not possible to distribute across GPUs?