-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very high memory usage #220
Comments
Hello @lmcinnes! Thank you for the amazing package, the speed and accuracy of the index are impressive! Unfortunately, however, I also have a memory issue. I am trying to build a 5-NN graph for 70,000,000 1024-dimensional vectors but it does not fit into 1.75 TB of RAM even with the least-efficient setting. pynn_knn = pynndescent.PyNNDescentTransformer(
metric='cosine', n_neighbors=5, search_epsilon=0.25, n_jobs=1, low_memory=True, verbose=True
).fit_transform(my_vectors) Could you advice us know to make PyNNDescent more memory-efficient? For example, would it make sense to decrease the values of the |
For a dataset that large I think you’ll be much better off using an index
that uses quantization such as LanceDB or diskANN.
…On Fri, Mar 22, 2024 at 2:03 PM Roman Bushuiev ***@***.***> wrote:
Hello @lmcinnes <https://github.com/lmcinnes>! Thank you for the amazing
package, the speed and accuracy of the index are impressive! Unfortunately,
however, I also have a memory issue. I am trying to build a 5-NN graph for
70,000,000 1024-dimensional vectors but it does not fit into 1.75 TB of RAM
even with the least-efficient setting.
pynn_knn = pynndescent.PyNNDescentTransformer(
metric='cosine', n_neighbors=5, search_epsilon=0.25, n_jobs=1, low_memory=True, verbose=True
).fit_transform(my_vectors)
Could you advice us know to make PyNNDescent more memory-efficient? For
example, would it make sense to decrease the values of the
pruning_degree_multiplier and max_candidates parameters? In my use case,
an accuracy drop is acceptable. Thank you.
—
Reply to this email directly, view it on GitHub
<#220 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC3IUBLMMSC2YVWWYYCJXJ3YZQTYTAVCNFSM6AAAAAAZMHUZ72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJVGA2TSNJWGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thank you for your reply! In the end, I still used PyNNDescent by first creating two k-NN graphs for two halves of my data, clustering the nodes by BFS-based similarity cutoff, and then recreating the final k-NN graph for the cluster representatives. I found other packages less convenient for creating k-NN graphs, as they typically provide only indexing capabilities and the graphs have to be created by reiterating the whole index. 🙂 |
That makes sense if you just want a kNN graph. The process you used is similar to the dask-distributed pynndescent that exists here in the |
If NNDescent uses over 30GB for an example 20M x 2 dataset (150MB in single precision), that does seem a bit much indeed. That's 200x more memory than the dataset itself. |
I agree that sounds excessive, but there's really nothing in the algorithm that would call for that unless something has gone astray, or memory hasn't been collected yet. Possibly there is a memory leak in the numba compiled material? It's hard to say for sure, but I haven't personalyl encountered such a siutation. A last note; if you have 2D data you should be using a KD-Tree for exact neighbors. |
I ran a quick benchmark to profile NNDescent's memory usage: %load_ext memory_profiler
import numpy as np
from pynndescent import NNDescent
X = np.random.randn(1_000_000, 768).astype(dtype=np.float32)
print(f"Array memory usage: {round(X.nbytes / 1024 ** 3, 3)} GB")
%memit NNDescent(X, metric="cosine", compressed=True) Output:
Doesn't look unreasonable to me: NNDescent uses about 4x as much memory as the dataset's size. Profiling was done on an M1 Max with pynndescent v0.5.12 and numba v0.59.1. |
Hi, I am experimenting excessive memory utilization of the library:
The following code exceeds 30GB of memory in my machine. Surprisingly, other libraries like hnswlib are capable of building the same index in approximately 5GB of memory.
Is this a possible bug/memory leak in the library or is it the expected memory usage?
Testing with version 0.5.10
The text was updated successfully, but these errors were encountered: