Skip to content

No memory reduction observed in a simple sparse-dense multiplication #15

Open
@x-zho14

Description

@x-zho14

Hi, I experiment with the following codes:

import torch
from pytorch_block_sparse import BlockSparseLinear
import time
import sys
iter = int(sys.argv[1])
dsty = float(sys.argv[2])

fc = BlockSparseLinear(1024, 256, density=dsty)
fc_dense = torch.nn.Linear(1024, 256).cuda()
input = torch.ones(3, 1024).cuda()

i = 0
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
t1 = time.time()

while(i < iter):
    output = fc(input)
    i += 1
end.record()
t2 = time.time()

torch.cuda.synchronize()
print("cpu time:", t2-t1)
print(start.elapsed_time(end))
print(torch.cuda.memory_summary())

i = 0
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
t1 = time.time()

while(i < iter):
    output = fc_dense(input)
    i += 1

end.record()
t2 = time.time()
torch.cuda.synchronize()
print("cpu time:", t2-t1)
print(start.elapsed_time(end))
print(torch.cuda.memory_summary())

And I find that the running time is decreased when iteration is small, while the memory consumption is not decreased.
sparse:

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    1248 KB |    1254 KB |    7280 KB |    6032 KB |
|       from large pool |       0 KB |       0 KB |       0 KB |       0 KB |
|       from small pool |    1248 KB |    1254 KB |    7280 KB |    6032 KB |
|---------------------------------------------------------------------------|
| Active memory         |    1248 KB |    1254 KB |    7280 KB |    6032 KB |
|       from large pool |       0 KB |       0 KB |       0 KB |       0 KB |
|       from small pool |    1248 KB |    1254 KB |    7280 KB |    6032 KB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    2048 KB |    2048 KB |    2048 KB |       0 B  |
|       from large pool |       0 KB |       0 KB |       0 KB |       0 B  |
|       from small pool |    2048 KB |    2048 KB |    2048 KB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |     800 KB |    2047 KB |    8080 KB |    7280 KB |
|       from large pool |       0 KB |       0 KB |       0 KB |       0 KB |
|       from small pool |     800 KB |    2047 KB |    8080 KB |    7280 KB |
|---------------------------------------------------------------------------|
| Allocations           |      12    |      15    |    2066    |    2054    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |      12    |      15    |    2066    |    2054    |
|---------------------------------------------------------------------------|
| Active allocs         |      12    |      15    |    2066    |    2054    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |      12    |      15    |    2066    |    2054    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       1    |       1    |       1    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       1    |       1    |       1    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       5    |       5    |    1033    |    1028    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       5    |       5    |    1033    |    1028    |
|===========================================================================|

dense:

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    1248 KB |    1251 KB |    4280 KB |    3032 KB |
|       from large pool |       0 KB |       0 KB |       0 KB |       0 KB |
|       from small pool |    1248 KB |    1251 KB |    4280 KB |    3032 KB |
|---------------------------------------------------------------------------|
| Active memory         |    1248 KB |    1251 KB |    4280 KB |    3032 KB |
|       from large pool |       0 KB |       0 KB |       0 KB |       0 KB |
|       from small pool |    1248 KB |    1251 KB |    4280 KB |    3032 KB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    2048 KB |    2048 KB |    2048 KB |       0 B  |
|       from large pool |       0 KB |       0 KB |       0 KB |       0 B  |
|       from small pool |    2048 KB |    2048 KB |    2048 KB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |     800 KB |    2047 KB |    5080 KB |    4280 KB |
|       from large pool |       0 KB |       0 KB |       0 KB |       0 KB |
|       from small pool |     800 KB |    2047 KB |    5080 KB |    4280 KB |
|---------------------------------------------------------------------------|
| Allocations           |      12    |      15    |    1066    |    1054    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |      12    |      15    |    1066    |    1054    |
|---------------------------------------------------------------------------|
| Active allocs         |      12    |      15    |    1066    |    1054    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |      12    |      15    |    1066    |    1054    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       1    |       1    |       1    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       1    |       1    |       1    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       5    |       5    |     533    |     528    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       5    |       5    |     533    |     528    |
|===========================================================================|

Could you please help with finding the problem? Actually the total alloc memory is even higher. Thanks in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions