Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing current Score in our Schema #79

Closed
6 tasks
JolanThomassin opened this issue Mar 11, 2024 · 3 comments · May be fixed by #84
Closed
6 tasks

Testing current Score in our Schema #79

JolanThomassin opened this issue Mar 11, 2024 · 3 comments · May be fixed by #84
Assignees
Milestone

Comments

@JolanThomassin
Copy link
Contributor

JolanThomassin commented Mar 11, 2024

Test our schema scoring to see if they are here/accurate.

Schema tested: Louis_v005

  • Pick a list of random chunk/crawl

Test our score weights

  • Similarity
  • Recency
  • Traffic
  • Current
  • Typicality

Evaluation Criteria

Similarity: ?
Recency: Compare dates and scores to ensure alignment.
Traffic: ?
Current: Confirm that archived documents receive a score of 0.
Typicality: Compare the average number of site references within the dataset and verify if documents with more references receive higher scores.

@JolanThomassin JolanThomassin moved this to In Progress in Database Mar 11, 2024
@JolanThomassin JolanThomassin self-assigned this Mar 11, 2024
@JolanThomassin JolanThomassin added this to the louis_v005 milestone Mar 11, 2024
@JolanThomassin
Copy link
Contributor Author

For now, when I check Similarity and Traffic, I'm having trouble understanding how they have been implemented in the schema and, therefore, how to evaluate them. Do you have a solution, @rngadam?

Similarity: Represents the primary signal in our scoring mechanism. It denotes the measurement of how closely a document aligns with the user's query, reflecting the relevance of the document to the search criteria. It is not a static precomputed score but a dynamic metric that is computed on the fly from the search results. This ensures that the relevance is tied to the specific query, going beyond simple keyword matching. Documents with higher similarity scores are considered more relevant. By prioritizing similarity as the first signal in our scoring process, we aim to deliver search results that are more accurate.

Traffic: The frequency with which users consult a document influences its score. Popular or frequently accessed documents are given higher scores with the help of web traffic logs, indicating their relevance and importance to users. Warning: The home page is rated really high since it's where every user land at first.

@JolanThomassin JolanThomassin linked a pull request Mar 14, 2024 that will close this issue
6 tasks
@JolanThomassin JolanThomassin moved this to In Progress in Finesse Mar 14, 2024
@rngadam
Copy link
Contributor

rngadam commented Mar 21, 2024

Similarity is dynamic, it cannot by definition be precomputed. we the 1 - cosine distance between chunk vector and query vector for this:

select s.crawl_id as id, s.chunk_id, 'similarity'::score_type as score_type, s.score as score

Traffic comes from the webserver logs. I'm imported and computed a score per page in the past already:

https://github.com/ai-cfia/ailab-db/blob/main/sql/compute-traffic-score.sql

@JolanThomassin
Copy link
Contributor Author

After testing the score seems to be accurate.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Finesse Apr 25, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Database Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants