Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very high memory usage #220

Open
fmoralesalcaide opened this issue Jun 19, 2023 · 7 comments
Open

Very high memory usage #220

fmoralesalcaide opened this issue Jun 19, 2023 · 7 comments

Comments

@fmoralesalcaide
Copy link

fmoralesalcaide commented Jun 19, 2023

Hi, I am experimenting excessive memory utilization of the library:

import numpy as np
X = np.random.rand(20_000_000, 2) # 20MM 2D vectors

index = pynndescent.NNDescent(X)
index.prepare()

The following code exceeds 30GB of memory in my machine. Surprisingly, other libraries like hnswlib are capable of building the same index in approximately 5GB of memory.

Is this a possible bug/memory leak in the library or is it the expected memory usage?
Testing with version 0.5.10

@roman-bushuiev
Copy link

Hello @lmcinnes! Thank you for the amazing package, the speed and accuracy of the index are impressive! Unfortunately, however, I also have a memory issue. I am trying to build a 5-NN graph for 70,000,000 1024-dimensional vectors but it does not fit into 1.75 TB of RAM even with the least-efficient setting.

pynn_knn = pynndescent.PyNNDescentTransformer(
    metric='cosine', n_neighbors=5, search_epsilon=0.25, n_jobs=1, low_memory=True, verbose=True
).fit_transform(my_vectors)

Could you advice us know to make PyNNDescent more memory-efficient? For example, would it make sense to decrease the values of the pruning_degree_multiplier and max_candidates parameters? In my use case, an accuracy drop is acceptable. Thank you.

@lmcinnes
Copy link
Owner

lmcinnes commented Mar 22, 2024 via email

@roman-bushuiev
Copy link

For a dataset that large I think you’ll be much better off using an index that uses quantization such as LanceDB or diskANN.

Thank you for your reply! In the end, I still used PyNNDescent by first creating two k-NN graphs for two halves of my data, clustering the nodes by BFS-based similarity cutoff, and then recreating the final k-NN graph for the cluster representatives. I found other packages less convenient for creating k-NN graphs, as they typically provide only indexing capabilities and the graphs have to be created by reiterating the whole index. 🙂

@lmcinnes
Copy link
Owner

That makes sense if you just want a kNN graph. The process you used is similar to the dask-distributed pynndescent that exists here in the distributed branch. If there is significant call for such an option I might try to finish it up and get it merged into main.

@lsorber
Copy link

lsorber commented May 20, 2024

If NNDescent uses over 30GB for an example 20M x 2 dataset (150MB in single precision), that does seem a bit much indeed. That's 200x more memory than the dataset itself.

@lmcinnes
Copy link
Owner

I agree that sounds excessive, but there's really nothing in the algorithm that would call for that unless something has gone astray, or memory hasn't been collected yet. Possibly there is a memory leak in the numba compiled material? It's hard to say for sure, but I haven't personalyl encountered such a siutation.

A last note; if you have 2D data you should be using a KD-Tree for exact neighbors.

@lsorber
Copy link

lsorber commented May 21, 2024

I ran a quick benchmark to profile NNDescent's memory usage:

%load_ext memory_profiler

import numpy as np
from pynndescent import NNDescent

X = np.random.randn(1_000_000, 768).astype(dtype=np.float32)
print(f"Array memory usage: {round(X.nbytes / 1024 ** 3, 3)} GB")

%memit NNDescent(X, metric="cosine", compressed=True)

Output:

Array memory usage: 2.861 GB
peak memory: 11293.77 MiB, increment: 8151.94 MiB (in 1m59.9s)

Doesn't look unreasonable to me: NNDescent uses about 4x as much memory as the dataset's size. Profiling was done on an M1 Max with pynndescent v0.5.12 and numba v0.59.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants