[Python] How to use parallel torch cuda streams without causing oom? (example included)

Stack · Setembro 11, 2024

I'm storing a large amount of tensors data in a cpu memory, and the intended workflow is to process them using the GPU. And while one chunk is being processed, simultaneously transfer the previous chunk's result back into cpu. And simultaneously transfer the next chunk into gpu so it's ready to process and the gpu doesn't have to wait for a synchronous transfer.

For this I used default cuda stream for processing, and two additional streams for parallel asynchronous transfer. However, in the actual application, whenever I used s2 stream for copying tensor into gpu (instead of default stream), it did increase the speed! but always caused a quick and steady rise of gpu memory until it's overflowed.

I did try to reproduce this behavior but was unable to, my example seems to not cause memory issues. However, I believe I may have still incorrectly used the streams somehow. So I'm hoping that if anyone worked with streams before may be able to spot the error? In current example s2 waits for s1, but any combination I tried fails in the actual application.

import torch
from time import perf_counter
import sys
from threading import Thread

cpu = torch.device('cpu')
gpu = torch.device('cuda')

_range = range(10)
tensors = [torch.rand(100000000, device=cpu) for i in _range]

s1 = torch.cuda.Stream(device=gpu)
s2 = torch.cuda.Stream(device=gpu)

for i in range(10000000000):
time_start = perf_counter()

for j in range(-1,11):
def PROCESS():
if j in _range:
k = tensors[j]
for l in range(10):
k.mul_(1.01)
k.add_(1.01)
k.pow_(0.5)

def CPU():
if j-1 in _range:
with torch.cuda.stream(s1):
p = tensors[j-1]
p.record_stream(s1)
p.data = p.data.to(device=cpu, memory_format=torch.preserve_format, non_blocking=True)

def GPU():
if j+1 in _range:
# using second stream causes oom in the actual application
with torch.cuda.stream(s2):
g = tensors[j+1]
g.data = g.data.to(device=gpu, memory_format=torch.preserve_format, non_blocking=True)
g.record_stream(s2)
# note: for s2, record stream is used after the assignment,
# because it is initially on cpu and I couldn't record on a cpu tensor

t2 = Thread(target=CPU)
t2.start()
t3 = Thread(target=GPU)
t3.start()
t1 = Thread(target=PROCESS)
t1.start()

s1.wait_stream(torch.cuda.default_stream(gpu))
s2.wait_stream(s1)
s1.synchronize()

t2.join()
t3.join()
t1.join()

lapsed = perf_counter() - time_start
time_duration = "%.5f sec/it" % lapsed

print(f"\rspeed: {time_duration}",end="\r")

You can just run the example and see the result for yourself, it does work faster when the stream 2 is used, and if that part is commented out it would run slower. In the actual app it is actually much more complicated because it uses thousands of those tensors and the loop is not straight forward, however using the s2 stream always causes oom.

Continue reading...

Logar ou Criar uma Conta

[Python] How to use parallel torch cuda streams without causing oom? (example included)

Stack Membro Participativo

Compartilhe esta Página

Logar ou Criar uma Conta

[Python] How to use parallel torch cuda streams without causing oom? (example included)

Stack Membro Participativo

Compartilhe esta Página

Pesquisas Úteis