Skip to content

Commit

Permalink
Merge branch 'main' into richard-nrl
Browse files Browse the repository at this point in the history
  • Loading branch information
ricsi98 authored Jan 12, 2024
2 parents d07a9ea + efb3495 commit d8e281d
Show file tree
Hide file tree
Showing 9 changed files with 120 additions and 7 deletions.
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
docs/tools/vdb_table/data @superlinked/vdb-table-maintainers
File renamed without changes.
File renamed without changes.
17 changes: 17 additions & 0 deletions .github/ISSUE_TEMPLATE/vdb-table_issue.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
---
name: Vector DB Comparison
about: Help us keep VDB comparison up-to date and bug free
title: ''
labels: 'vdb comparison'
assignees: AruneshSingh
---


## Issue

**Please share the bug or issue you are facing. (A screenshot would be appreciated)**

**Please share your source in case the content is out of date, or the relevant [discussion](https://github.com/superlinked/VectorHub/discussions/categories/vdb-comparison)**

Please read the appropriate [discussion](https://github.com/superlinked/VectorHub/discussions/categories/vdb-comparison) in case a similar conversation has already happened, and the [contribution guidelines](https://github.com/superlinked/VectorHub/tree/main/docs/tools/vdb_table), before submitting the issue. Thanks!

10 changes: 10 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE/vdb-table_pr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
labels: 'vdb comparison'
---

## Describe your changes

## Link to relevant [discussion](https://github.com/superlinked/VectorHub/discussions/categories/vdb-comparison)

## Checklist before requesting a review
- [ ] I have followed the [contribution guidelines](https://github.com/superlinked/VectorHub/tree/main/docs/tools/vdb_table)
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@


## Describe your changes

## Issue ticket number and link
Expand Down
Binary file added docs/assets/tools/vdb_table/cover.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
84 changes: 83 additions & 1 deletion docs/tools/vdb_table/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,83 @@
# VDB Comparison
# Vector DB Comparison

![](../../assets/tools/vdb_table/cover.gif)

[Vector DB Comparison](https://vdbs.superlinked.com/) is a free and open source tool from VectorHub to compare vector databases. It is created to outline the feature sets of different VDB solutions. Each of the features outlined has been verified to varying degrees.

This is a community initiative spearheaded by [Dhruv Anand](https://www.linkedin.com/in/dhruv-anand-ainorthstartech/), Founder of AI Northstar Tech, to give visibility to the different VDB offerings. Following the initial launch of the benchmarking table, a group of collaborators (listed below) was formed to verify claims before publishing on VectorHub.

[VectorHub](https://hub.superlinked.com/) is a community-driven learning platform for information retrieval hosted by [Superlinked](https://superlinked.com/). Superlinked is a vector compute solution in the ml stack alongside the different VDBs.

For this exercise, the collaborators have worked with points of contact from the VBDs to ensure neutrality and fairness to create an accurate tool for practitioners.

**Table Interactions**
- Search: Use the search bar on top.
- Sort: Click on the column names to sort. Shift click multiple headers to sort by multiple columns in the order of clicks.
- Filter: Hover on column name to get the filter menu icon. Click the relevant value to filter.
- Vendor links: Each vendor has links to their website, github, documentation, discussion (on this github repo) and point of contact. Click the link button next to the vendor name.
- Documentation links: Cells which have supporting link to vendor's documentation have an external link button in the cell.
- Comments: Additional comments by maintainers, are shown on hovering over a cell.


**Maintainers:**
- [Dhruv Anand](https://www.linkedin.com/in/dhruv-anand-ainorthstartech/)
- [Prashanth Rao](https://www.linkedin.com/in/prrao87/)
- [Ravi Harige](https://www.linkedin.com/in/ravindraharige/)
- [Daniel Svonava](https://www.linkedin.com/in/svonava/)

**Frontend:**
- [Arunesh Singh](https://www.linkedin.com/in/aruneshsingh99/)



## Contributing

Thanks for your interest in contributing to [vdbs.superlinked.com](https://vdbs.superlinked.com) and keeping the data upto date.

We use [discussions](https://github.com/superlinked/VectorHub/discussions/categories/vdb-comparison) as our way to have conversations about each vendor. Please find the relevant discussion and add to the conversation.

Kindly review the following sections before you submit your issue or initial pull request, and use the approriate issues/PR template. In addition, check for existing open issues and pull requests to ensure that someone else has not already corrected the information.

If you need any help, feel free to tag [@AruneshSingh](https://github.com/AruneshSingh) in your discussions/issues/PRs.

### About this repository

- **Frontend:** The frontend is created in React, using ag-grid for the tables and Material UI for the interface components. It's hosted and deployed using Vercel.
- B**ackend:** This github serves as the data for the table. Any updates to the JSON files are validated using github actions and pushed to a Google storage bucket, from where the frontend fetches the data.
- **Discussions:** All the vendors have a dedicated discussion [here](https://github.com/superlinked/VectorHub/discussions/categories/vdb-comparison). Please ensure to go through the discussions before raising an issue/PR.

### Structure

This subdirectory is structured as follows:

```
tools/
└── vdb_table/
├── vendor.schema.json
├── ...
├── ...
└── data/
├── vendor1.json
├── vendor2.json
├── vendor3.json
└── ...
```

| File | Description |
| --------------------- | ---------------------------------------- |
| `vendor.schema.json` | JSON Schema file that describes the attributes, it's properties, and description |
| `vendorX.json` | All the attribute data for vendor X. We have one file per vendor. |


Attributes inside vendorX.json has the following properties
- `support`: Whose values can be `[ "", "none", "partial", "full" ]` indicating on confidence levels, for that attribute support.
- `value`: `license` and `dev_languages` have this property to support values about license details and languages (as a list).
- `source_url`: To provide documentation links, or evidence supporting the attribute values. It is shown as the 'external link' button in the cell.
- `comment`: Any other useful information that will be shown on hover and with the info icon.
- `unlimited`: `doc_size` and `vector_dims` can have this property set true if they support unlimited values.

For more nuanced information about each property, have a look at `vendor.schema.json`.




13 changes: 7 additions & 6 deletions docs/use_cases/node_representation_learning.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

## Introduction: representing things and relationships between them

Of the various types of information - words, pictures, and connections between things - **relationships** are especially interesting. Relationships show how things interact and create networks. But not all ways of representing relationships are the same. In machine learning, **how we do vector representation of things and the relationships between them affects performance** on a wide range of tasks.
Of the various types of information - words, pictures, and connections between things - **relationships** are especially interesting. Relationships show how things interact and create networks. But not all ways of representing relationships are the same. In machine learning, **how we do vector representation of things and their relationships affects performance** on a wide range of tasks.

Below, we evaluate several approaches to vector representation on a real-life use case: how well each approach classifies academic articles in a subset of the Cora citation network.

Expand Down Expand Up @@ -48,11 +48,11 @@ evaluate(ds.x, ds.y)
>>> F1 macro 0.701
```

BoW's accuracy and F1 macro scores are pretty good, but leave significant room for improvement. BoW falls short of correctly classify papers more than 25% of the time. And on average across classes BoW is inaccurate nearly 30% of the time.
BoW's accuracy and F1 macro scores are pretty good, but leave significant room for improvement. BoW falls short of correctly classifying papers more than 25% of the time. And on average across classes BoW is inaccurate nearly 30% of the time.

## Taking advantage of citation graph data

Can we improve on this? Our citation dataset contains not only text data but also relationship data - a citation graph. Any given article will tend to cite other articles that belong to the same topic that it belongs to. Therefore, representations that embed not just textual data but also citation data of articles contained in our network will probably classify articles more accurately.
Can we improve on this? Our citation dataset contains not only text data but also relationship data - a citation graph. Any given article will tend to cite other articles that belong to the same topic that it belongs to. Therefore, representations that embed not just textual data but also citation data will probably classify articles more accurately.

BoW features represent text data. But how well does BoW capture the relationships between articles?

Expand Down Expand Up @@ -263,7 +263,7 @@ print(next(iter(loader)))
>>> Data(x=[2646, 1433], edge_index=[2, 8642], edge_label_index=[2, 2048], edge_label=[2048], ...)
```

In the `Data` object `x` contains the BoW node features. The `edge_label_index` tensor contains the head and tail node indices for the positive and negative samples. `edge_label` is the binary target for these pairs (1 for positive 0 for negative samples). The `edge_index` tensor holds the adjacency list for the current batch of nodes.
In the `Data` object, `x` contains the BoW node features. The `edge_label_index` tensor contains the head and tail node indices for the positive and negative samples. `edge_label` is the binary target for these pairs (1 for positive 0 for negative samples). The `edge_index` tensor holds the adjacency list for the current batch of nodes.

Now we can **train** our model as follows:

Expand Down Expand Up @@ -329,11 +329,12 @@ The results obtained with LLM only, Node2Vec combined with LLM, and GraphSAGE tr
| F1 (macro) | 0.779 (+7.8%) | **0.840** (+0.9%) | 0.831 (+1.1%) |


Let's explore how good LLM vectors are at *representing citation data*.
Let's also see **how well LLM vectors represent citation data**, again plotting connected and not connected pairs in terms of cosine similarity. How well do citation pairs show up in LLM vectors compared with BoW and Node2Vec?

![LLM cosine similarity edge counts](../assets/use_cases/node_representation_learning/bins_llm.png)

With LLM embeddings, nodes that are connected have a stronger similarity between their representations, much stronger than using Bag of Words (BoW) features. However, for pairs of nodes that aren't connected, there's still a wide range of similarity values. This makes it challenging to easily tell them apart from connected pairs - meaning that they are somewhere in between BoW and Node2Vec features in capturing the graph structure.
In LLM embeddings, positive (connected) citation pairs have higher cosine similarity values relative to all pairs, thus representing an improvement over BoW features in identifying positive pairs. However, negative (unconnected) pairs in LLM embeddings have a wide range of cosine similarity scores, making it difficult to distinguish connected from unconnected pairs. While LLM embeddings do better over all than BoW features at capturing the citation graph structure, they do not do as well as Node2Vec.


## Conclusion: LLM, Node2Vec, GraphSAGE better at learning node and node relationship data than BoW

Expand Down

0 comments on commit d8e281d

Please sign in to comment.