1. Anuncie Aqui ! Entre em contato fdantas@4each.com.br

[Python] How to use parallel torch cuda streams without causing oom? (example included)

Discussão em 'Python' iniciado por Stack, Setembro 11, 2024.

  1. Stack

    Stack Membro Participativo

    I'm storing a large amount of tensors data in a cpu memory, and the intended workflow is to process them using the GPU. And while one chunk is being processed, simultaneously transfer the previous chunk's result back into cpu. And simultaneously transfer the next chunk into gpu so it's ready to process and the gpu doesn't have to wait for a synchronous transfer.

    For this I used default cuda stream for processing, and two additional streams for parallel asynchronous transfer. However, in the actual application, whenever I used s2 stream for copying tensor into gpu (instead of default stream), it did increase the speed! but always caused a quick and steady rise of gpu memory until it's overflowed.

    I did try to reproduce this behavior but was unable to, my example seems to not cause memory issues. However, I believe I may have still incorrectly used the streams somehow. So I'm hoping that if anyone worked with streams before may be able to spot the error? In current example s2 waits for s1, but any combination I tried fails in the actual application.

    import torch
    from time import perf_counter
    import sys
    from threading import Thread

    cpu = torch.device('cpu')
    gpu = torch.device('cuda')

    _range = range(10)
    tensors = [torch.rand(100000000, device=cpu) for i in _range]

    s1 = torch.cuda.Stream(device=gpu)
    s2 = torch.cuda.Stream(device=gpu)

    for i in range(10000000000):
    time_start = perf_counter()

    for j in range(-1,11):
    def PROCESS():
    if j in _range:
    k = tensors[j]
    for l in range(10):
    k.mul_(1.01)
    k.add_(1.01)
    k.pow_(0.5)

    def CPU():
    if j-1 in _range:
    with torch.cuda.stream(s1):
    p = tensors[j-1]
    p.record_stream(s1)
    p.data = p.data.to(device=cpu, memory_format=torch.preserve_format, non_blocking=True)

    def GPU():
    if j+1 in _range:
    # using second stream causes oom in the actual application
    with torch.cuda.stream(s2):
    g = tensors[j+1]
    g.data = g.data.to(device=gpu, memory_format=torch.preserve_format, non_blocking=True)
    g.record_stream(s2)
    # note: for s2, record stream is used after the assignment,
    # because it is initially on cpu and I couldn't record on a cpu tensor

    t2 = Thread(target=CPU)
    t2.start()
    t3 = Thread(target=GPU)
    t3.start()
    t1 = Thread(target=PROCESS)
    t1.start()

    s1.wait_stream(torch.cuda.default_stream(gpu))
    s2.wait_stream(s1)
    s1.synchronize()

    t2.join()
    t3.join()
    t1.join()

    lapsed = perf_counter() - time_start
    time_duration = "%.5f sec/it" % lapsed

    print(f"\rspeed: {time_duration}",end="\r")


    You can just run the example and see the result for yourself, it does work faster when the stream 2 is used, and if that part is commented out it would run slower. In the actual app it is actually much more complicated because it uses thousands of those tensors and the loop is not straight forward, however using the s2 stream always causes oom.

    Continue reading...

Compartilhe esta Página