Very high memory usage #220

fmoralesalcaide · 2023-06-19T17:45:45Z

Hi, I am experimenting excessive memory utilization of the library:

import numpy as np
X = np.random.rand(20_000_000, 2) # 20MM 2D vectors

index = pynndescent.NNDescent(X)
index.prepare()

The following code exceeds 30GB of memory in my machine. Surprisingly, other libraries like hnswlib are capable of building the same index in approximately 5GB of memory.

Is this a possible bug/memory leak in the library or is it the expected memory usage?
Testing with version 0.5.10

roman-bushuiev · 2024-03-22T13:02:42Z

Hello @lmcinnes! Thank you for the amazing package, the speed and accuracy of the index are impressive! Unfortunately, however, I also have a memory issue. I am trying to build a 5-NN graph for 70,000,000 1024-dimensional vectors but it does not fit into 1.75 TB of RAM even with the least-efficient setting.

pynn_knn = pynndescent.PyNNDescentTransformer(
    metric='cosine', n_neighbors=5, search_epsilon=0.25, n_jobs=1, low_memory=True, verbose=True
).fit_transform(my_vectors)

Could you advice us know to make PyNNDescent more memory-efficient? For example, would it make sense to decrease the values of the pruning_degree_multiplier and max_candidates parameters? In my use case, an accuracy drop is acceptable. Thank you.

lmcinnes · 2024-03-22T13:45:58Z

For a dataset that large I think you’ll be much better off using an index that uses quantization such as LanceDB or diskANN.

…

On Fri, Mar 22, 2024 at 2:03 PM Roman Bushuiev ***@***.***> wrote: Hello @lmcinnes <https://github.com/lmcinnes>! Thank you for the amazing package, the speed and accuracy of the index are impressive! Unfortunately, however, I also have a memory issue. I am trying to build a 5-NN graph for 70,000,000 1024-dimensional vectors but it does not fit into 1.75 TB of RAM even with the least-efficient setting. pynn_knn = pynndescent.PyNNDescentTransformer( metric='cosine', n_neighbors=5, search_epsilon=0.25, n_jobs=1, low_memory=True, verbose=True ).fit_transform(my_vectors) Could you advice us know to make PyNNDescent more memory-efficient? For example, would it make sense to decrease the values of the pruning_degree_multiplier and max_candidates parameters? In my use case, an accuracy drop is acceptable. Thank you. — Reply to this email directly, view it on GitHub <#220 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC3IUBLMMSC2YVWWYYCJXJ3YZQTYTAVCNFSM6AAAAAAZMHUZ72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJVGA2TSNJWGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

roman-bushuiev · 2024-03-25T22:20:55Z

For a dataset that large I think you’ll be much better off using an index that uses quantization such as LanceDB or diskANN.

Thank you for your reply! In the end, I still used PyNNDescent by first creating two k-NN graphs for two halves of my data, clustering the nodes by BFS-based similarity cutoff, and then recreating the final k-NN graph for the cluster representatives. I found other packages less convenient for creating k-NN graphs, as they typically provide only indexing capabilities and the graphs have to be created by reiterating the whole index. 🙂

lmcinnes · 2024-03-26T00:06:59Z

That makes sense if you just want a kNN graph. The process you used is similar to the dask-distributed pynndescent that exists here in the distributed branch. If there is significant call for such an option I might try to finish it up and get it merged into main.

lsorber · 2024-05-20T18:42:14Z

If NNDescent uses over 30GB for an example 20M x 2 dataset (150MB in single precision), that does seem a bit much indeed. That's 200x more memory than the dataset itself.

lmcinnes · 2024-05-21T03:48:23Z

I agree that sounds excessive, but there's really nothing in the algorithm that would call for that unless something has gone astray, or memory hasn't been collected yet. Possibly there is a memory leak in the numba compiled material? It's hard to say for sure, but I haven't personalyl encountered such a siutation.

A last note; if you have 2D data you should be using a KD-Tree for exact neighbors.

lsorber · 2024-05-21T06:43:22Z

I ran a quick benchmark to profile NNDescent's memory usage:

%load_ext memory_profiler

import numpy as np
from pynndescent import NNDescent

X = np.random.randn(1_000_000, 768).astype(dtype=np.float32)
print(f"Array memory usage: {round(X.nbytes / 1024 ** 3, 3)} GB")

%memit NNDescent(X, metric="cosine", compressed=True)

Output:

Array memory usage: 2.861 GB
peak memory: 11293.77 MiB, increment: 8151.94 MiB (in 1m59.9s)

Doesn't look unreasonable to me: NNDescent uses about 4x as much memory as the dataset's size. Profiling was done on an M1 Max with pynndescent v0.5.12 and numba v0.59.1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very high memory usage #220

Very high memory usage #220

fmoralesalcaide commented Jun 19, 2023 •

edited

Loading

roman-bushuiev commented Mar 22, 2024

lmcinnes commented Mar 22, 2024 via email

roman-bushuiev commented Mar 25, 2024

lmcinnes commented Mar 26, 2024

lsorber commented May 20, 2024

lmcinnes commented May 21, 2024

lsorber commented May 21, 2024 •

edited

Loading

Very high memory usage #220

Very high memory usage #220

Comments

fmoralesalcaide commented Jun 19, 2023 • edited Loading

roman-bushuiev commented Mar 22, 2024

lmcinnes commented Mar 22, 2024 via email

roman-bushuiev commented Mar 25, 2024

lmcinnes commented Mar 26, 2024

lsorber commented May 20, 2024

lmcinnes commented May 21, 2024

lsorber commented May 21, 2024 • edited Loading

fmoralesalcaide commented Jun 19, 2023 •

edited

Loading

lsorber commented May 21, 2024 •

edited

Loading