Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for Threshold, Limit, and Order Arguments #12

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

sebscholl
Copy link

This pull request adds 3 keyword arguments to the nearest_neighbor method. They are:

order

movie = Movie.find_by(name: "Star Wars (1977)")
# Order all results by the neighbor_distance column in descending order
movie.nearest_neighbors(:factors, distance: "inner_product", order: { neighbor_distance: :desc })

limit

movie = Movie.find_by(name: "Star Wars (1977)")
# Limit the results to 3 records
movie.nearest_neighbors(:factors, distance: "inner_product", limit: 3)

threshold

movie = Movie.find_by(name: "Star Wars (1977)")
# Only return records where the neighbor_distance is greater than or equal to 0.9
movie.nearest_neighbors(:factors, distance: "inner_product", threshold: { gte: 0.9 })

Multiple Options

All options can be used at the same time or separately.

movie = Movie.find_by(name: "Star Wars (1977)")

# Only return 5 records where the neighbor_distance is greater than or equal to 0.9 in descending order
movie.nearest_neighbors(
  :factors,
  distance: "inner_product", 
  limit: 5,
  threshold: { gte: 0.9 },
  order: { neighbor_distance: :desc }
)

These options manipulate the SQL statement generated by ActiveRecord. All original test suits are intact and passing, and the new tests were written with the new options.

sSebastian Scholl added 2 commits September 2, 2023 11:44
…_neighbor method/scope. All test cases are passing and new ones added for new options.
@ankane
Copy link
Owner

ankane commented Sep 24, 2023

Hi @sebscholl, thanks for the PR.

  1. nearest_neighbors currently returns a relation, so you can limit with limit(n) or first(n).
  2. Results are currently ordered by distance. However, if you have a default scope on the model, that'll take precedence.
  3. For thresholds, you can use where("(embedding <#> ?) * -1 > ?", vector, 0.9) or filter in memory with select { |v| v.neighbor_distance > 0.9 }. I may add an option for this at some point, but want to think more about the design.

@sebscholl
Copy link
Author

Makes sense. Do you believe it would be helpful to add this info to the docs (e.g, where("(embedding <#> ?) * -1 > ?", vector, 0.9)) or prefer to sit tight until you feel you have more clarity on the design? Lmk, and I can make an update if it would help.

@gvkhna
Copy link

gvkhna commented Oct 22, 2023

@ankane I believe in the case of using class method like Movie.nearest_neighbor(embedding, my_gen_embedding, ...) the ordering is not set by distance. Instead I'm getting ORDER BY "text_nodes"."id" ASC LIMIT $1 on these queries. I'm encountering this exact problem and so i'll implement the query manually as a workaround for now.

P.S. I set Movie.unscoped {} but still am getting ORDER BY ID, AFAIK there is no way to set to order by distance with the gem.

P.P.S I set .order(Arel.sql("neighbor_distance DESC")) but it didn't actually apply that to the query instead still ordering by ID.

@vestedpr-dev
Copy link

vestedpr-dev commented Apr 22, 2024

Regarding thresholds, if others are working on this, here is some relevant code I came up with:

def filter_by_within_distance(scope)
        return scope unless @params[:within_distance] && @params[:distance_type]

        distance_type = @params[:distance_type].to_sym

        # Determine the correct operator based on distance type
        operator = case distance_type
                   when :euclidean
                     "<->"
                   when :cosine
                     "<=>"
                   when :inner_product
                     "<#>"
                   else
                     raise ArgumentError, "Unsupported distance type: #{@params[:distance_type]}"
                   end

        condition_pattern = if distance_type == :inner_product
                              # Negative inner product
                              "((#{@params[:search_vector_column]} #{operator} '[?]') * -1) < ?"
                            else
                              "(#{@params[:search_vector_column]} #{operator} '[?]') < ?"
                            end

        scope.where(condition_pattern, @query_vector, @params[:within_distance])
      end

Some nuances to take note of:

  • When used within a Rails order clause, I needed to wrap the vector in single quotes and the square bracket to end up with valid SQL: '[?]'
  • This is elementary but not obvious at first to those new to pgvector: depending on the distance type you're using for the ordering, you need to use the same comparison operator (<->, <=>, <#>) in your limit statement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants