Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating a numpy array of pairwise similarity scores #25

Open
dhimmel opened this issue Apr 3, 2023 · 0 comments
Open

Generating a numpy array of pairwise similarity scores #25

dhimmel opened this issue Apr 3, 2023 · 0 comments

Comments

@dhimmel
Copy link
Member

dhimmel commented Apr 3, 2023

Here's some example code to generate pairwise lin similarity scores for all node pairs in an NXOntology. For now just posting in case it's helpful, but it's also possible we could create a function to populate a matrix with a similarity metric.

def generate_similarity_matrix(nxo: NXOntology[str]) -> npt.NDArray[np.float_]:
    nxo.freeze()
    nodes = list(nxo.graph.nodes)
    # ensure nodes are sorted, since matrix does not store row/column names
    assert sorted(nodes) == nodes
    similarity_array = np.zeros(shape=(nxo.n_nodes, nxo.n_nodes), dtype=np.float32)
    logging.info(
        f"Initialized {similarity_array.shape} array:\n{similarity_array[:5, :5]}"
    )
    # lin is symmetric, so we use combinations_with_replacement rather than product
    for (row, row_efo), (col, col_efo) in combinations_with_replacement(
        list(enumerate(nodes)), r=2
    ):
        similarity = nxo.similarity(row_efo, col_efo)
        similarity_array[row, col] = similarity.lin
        # only works for symmetric metrics
        similarity_array[col, row] = similarity.lin
    logging.info(f"Populated array with similarity:\n{similarity_array[:5, :5]}")
    return similarity_array  # type:ignore[return-value]


similarity_array = generate_similarity_matrix(nxo)
path = f"similarity-lin.npy.xz"
with fsspec.open(path, "wb", compression="infer") as write_file:
    np.save(write_file, similarity_array)

On EFO, saving as an XZ compressed npy file worked well. Scipy.sparse matrices can also be considered but can be slower (or faster) to work with.

@dhimmel dhimmel transferred this issue from related-sciences/nxontology-data Apr 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant