-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Club Vector Embeddings #180
Changes from 8 commits
2686c5a
4e51895
10c0149
ad82b6b
28831ef
a911a12
cac9d25
d710b0b
eabc876
4dbdb4c
550d221
765dc09
baf2d2b
50484ea
45aa86d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
package errors | ||
|
||
import "github.com/gofiber/fiber/v2" | ||
|
||
var ( | ||
FailedToCreateEmbedding = Error{ | ||
StatusCode: fiber.StatusInternalServerError, | ||
Message: "failed to create embedding from string", | ||
} | ||
FailedToUpsertToPinecone = Error{ | ||
StatusCode: fiber.StatusInternalServerError, | ||
Message: "failed to upsert to pinecone", | ||
} | ||
FailedToDeleteToPinecone = Error{ | ||
StatusCode: fiber.StatusInternalServerError, | ||
Message: "failed to delete from pinecone", | ||
} | ||
FailedToSearchToPinecone = Error{ | ||
StatusCode: fiber.StatusInternalServerError, | ||
Message: "failed to search on pinecone", | ||
} | ||
) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -88,3 +88,15 @@ func (c *Club) AfterDelete(tx *gorm.DB) (err error) { | |
tx.Model(&c).Update("num_members", c.NumMembers-1) | ||
return | ||
} | ||
|
||
func (c *Club) SearchId() string { | ||
return c.ID.String() | ||
} | ||
|
||
func (c *Club) Namespace() string { | ||
return "clubs" | ||
} | ||
|
||
func (c *Club) EmbeddingString() string { | ||
return c.Name + " " + c.Name + " " + c.Name + " " + c.Name + " " + c.Description | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should be a formatted string; fmt.Sprintf |
||
} |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. see above |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit but solution should be on a new line |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
# Jargon | ||
|
||
### Embeddings | ||
**Problem**: We have arbitrary-dimension data, such as descriptions for clubs, or searches for | ||
events. Given a piece of this arbitrary-dimension data (search, club desc.) we want to find | ||
other arbitrary-dimension data that is similar to it; think 2 club descriptions where both clubs | ||
are acapella groups, 2 search queries that are both effectively looking for professional | ||
fraternities, etc. **Solution**: Transform the arbitrary-dimension data to fixed-dimension data, | ||
say, a vector of floating-point numbers that is *n*-elements large. Make the transformation in | ||
such a way that similar arbitrary-dimension pieces of data will also have similar | ||
fixed-dimension data, i.e vectors that are close together (think Euclidean distance). **How do | ||
we do this transformation**: Train a machine learning model on large amounts of text, and then | ||
use the model to make vectors. **So what's an embedding?** Formally, when we | ||
refer to the embedding of a particular object, we refer to the vector created by feeding that | ||
object through the machine-learning model. | ||
|
||
This is arguably the most complex/unintuitive part of understanding search, so here are some extra | ||
resources: | ||
- [What are embeddings?](https://www.cloudflare.com/learning/ai/what-are-embeddings/) | ||
- [fastai book - Chapters 10 and 12 are both about natural language processing](https://github.com/fastai/fastbook) | ||
- [Vector Embeddings for Developers: The Basics](https://www.pinecone.io/learn/vector-embeddings-for-developers/) | ||
|
||
### OpenAI API | ||
**Problem:**: We need a machine learning model to create the embeddings. **Solution:** Use | ||
OpenAI's api to create the embeddings for us; we send text over a REST api and we get a back a | ||
vector that represents that text's embedding. | ||
|
||
### PineconeDB | ||
**Problem**: We've created a bunch of embeddings for our club descriptions (or event | ||
descriptions, etc.), we now need a place to store them and a way to search through them (with an | ||
embedding for a search query) **Solution**: PineconeDB is a vector database that allows us to | ||
upload our embeddings and then query them by giving a vector to find similar ones to. | ||
|
||
# How to create searchable objects for fun and fame and profit | ||
|
||
```golang | ||
package search | ||
|
||
// in backend/search/searchable.go | ||
type Searchable interface { | ||
SearchId() string | ||
Namespace() string | ||
EmbeddingString() string | ||
} | ||
|
||
// in backend/search/pinecone.go | ||
type PineconeClientInterface interface { | ||
Upsert(item Searchable) *errors.Error | ||
Delete(item Searchable) *errors.Error | ||
Search(item Searchable, topK int) ([]string, *errors.Error) | ||
} | ||
``` | ||
|
||
1. Implement the `Searchable` interface on whatever model you want to make searchable. | ||
`Searchable` requires 3 methods: | ||
- `SearchId()`: This should return a unique id that can be used to store a model entry's | ||
embedding (if you want to store it at all) in PineconeDB. In practice, this should be the | ||
entry's UUID. | ||
- `Namespace()`: Namespaces are to PineconeDB what tables are to PostgreSQL. Searching in | ||
one namespace will only retrieve vectors in that namespace. In practice, this should be | ||
unique to the model type (i.e `Club`, `Event`, etc.) | ||
- `EmbeddingString()`: This should return the string you want to feed into the OpenAI API | ||
and create an embedding for. In practice, create a string with the fields you think will | ||
affect the embedding all appended together, and/or try repeating a field multiple times in | ||
the string to see if that gives a better search experience. | ||
2. Use a `PineconeClientInterface` and call `Upsert` with your searchable object to send it to the | ||
database, and `Delete` with your searchable object to delete it from the database. Upserts | ||
should be done on creation and updating of a model entry, and deletes should be done on | ||
deleting of a model entry. In practice, a `PineconeClientInterface` should be passed in to | ||
the various services in `backend/server.go`, similar to how `*gorm.DB` and `*validator. | ||
Validator` instances are passed in. | ||
|
||
# How to search for fun and fame and profit | ||
|
||
TODO: (probably create a searchable object that just uses namespace and embeddingstring, pass to | ||
pineconeclient search) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
package search | ||
|
||
import ( | ||
"bytes" | ||
"encoding/json" | ||
"fmt" | ||
"github.com/GenerateNU/sac/backend/src/errors" | ||
"github.com/garrettladley/mattress" | ||
"net/http" | ||
"os" | ||
) | ||
|
||
type OpenAiClientInterface interface { | ||
CreateEmbedding(payload string) ([]float32, *errors.Error) | ||
} | ||
|
||
type OpenAiClient struct { | ||
apiKey *mattress.Secret[string] | ||
} | ||
|
||
func NewOpenAiClient() *OpenAiClient { | ||
apiKey, _ := mattress.NewSecret(os.Getenv("SAC_OPENAI_API_KEY")) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should be in config as well, i assume @garrettladley |
||
|
||
return &OpenAiClient{apiKey: apiKey} | ||
} | ||
|
||
func (c *OpenAiClient) CreateEmbedding(payload string) ([]float32, *errors.Error) { | ||
apiKey := c.apiKey.Expose() | ||
|
||
embeddingBody, _ := json.Marshal(map[string]interface{}{ | ||
"input": payload, | ||
"model": "text-embedding-ada-002", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. would prefer this to be a config option, |
||
}) | ||
requestBody := bytes.NewBuffer(embeddingBody) | ||
|
||
req, err := http.NewRequest("POST", fmt.Sprintf("https://api.openai.com/v1/embeddings"), requestBody) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. these lines below are repeated over in your guys code, would prefer this to be generalized and exported into a utility function |
||
if err != nil { | ||
return nil, &errors.FailedToCreateEmbedding | ||
} | ||
|
||
req.Header.Set("Authorization", fmt.Sprintf("Bearer %s", apiKey)) | ||
req.Header.Set("content-type", "application/json") | ||
|
||
resp, err := http.DefaultClient.Do(req) | ||
if err != nil { | ||
return nil, &errors.FailedToCreateEmbedding | ||
} | ||
|
||
defer resp.Body.Close() | ||
|
||
if err != nil { | ||
return nil, &errors.FailedToCreateEmbedding | ||
} | ||
|
||
type ResponseBody struct { | ||
Data []struct { | ||
Embedding []float32 `json:"embedding"` | ||
} `json:"data"` | ||
} | ||
|
||
embeddingResultBody := ResponseBody{} | ||
err = json.NewDecoder(resp.Body).Decode(&embeddingResultBody) | ||
if err != nil { | ||
return nil, &errors.FailedToCreateEmbedding | ||
} | ||
|
||
if len(embeddingResultBody.Data) < 1 { | ||
return nil, &errors.FailedToCreateEmbedding | ||
} | ||
|
||
return embeddingResultBody.Data[0].Embedding, nil | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetPermissions/permissions in general is not a model, possibly not a type--can be put in the auth folder instead