Skip to content

Commit

Permalink
Merge pull request #64 from ai-cfia/61-explain-scores-and-weights
Browse files Browse the repository at this point in the history
Adding a md file explaining scores and weights
  • Loading branch information
melanie-fressard authored Dec 11, 2023
2 parents 8607c58 + 40f270e commit 7db06b4
Show file tree
Hide file tree
Showing 2 changed files with 98 additions and 6 deletions.
96 changes: 96 additions & 0 deletions search-scores-and-weights.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Scores and weights for search function

At the re-ranking stage, our search system uses scoring to assign values to each
document based on different parameters. The interplay between clustering and
scoring helps optimize the search process, ensuring that the system considers
both the content and context to deliver more accurate and relevant results for
the user.

This task becomes challenging when dealing with a large number of documents,
creating the need for optimization strategies. A good approach is the
utilization of an indexing method called clustering, which categorize documents
based on their topics, facilitating a more streamlined and efficient search
process. We also use scoring, which involves the assignment of numerical values
and weights to documents based on various parameters, influencing their ranking
in the search results. Here is our different scores, each serving a unique
purpose:

1. [**Similarity**](sql/2023-07-19-modify-score_type-add-similarity.sql):
Represents the primary signal in our scoring mechanism. It denotes the
measurement of how closely a document aligns with the user's query,
reflecting the relevance of the document to the search criteria. It is not a
static precomputed score but a dynamic metric that is computed on the fly
from the search results. This ensures that the relevance is tied to the
specific query, going beyond simple keyword matching. Documents with higher
similarity scores are considered more relevant. By prioritizing similarity as
the first signal in our scoring process, we aim to deliver search results
that are more accurate.
- Scaling: FROM 0.0 = least similar to user query TO 1.0 = most similar to
user query

1. [**Recency**](sql/schema2.sql.public_new.sql): This score considers the
temporal aspect of documents, prioritizing recently added or updated content.
A document's recency is crucial in reflecting the latest information
available to users.
- Scaling: FROM 0.0 = oldest document TO 1.0 = most recent document

1. [**Traffic**](sql/compute-traffic-score.sql): The frequency with which users
consult a document influences its score. Popular or frequently accessed
documents are given higher scores with the help of web traffic logs,
indicating their relevance and importance to users. Warning: The home page is
rated really high since it's where every user land at first.
- Scaling: FROM 0.0 = least consulted document TO 1.0 = most consulted
document

1. [**Current**](sql/2023-07-12-score-current.sql): This score determines
whether a document is currently accessible or if it has been archived. It
helps users distinguish between active and inactive content.
- Scaling: 0.0 = currently accessible document **OR** 1.0 = archived
document

1. [**Typicality**](sql/2023-07-12-calculate-incoming-outgoing-counts.sql): This
score evaluates how closely the number of site references for a document
aligns with the average. Documents with typicality scores reflect a level of
correspondence with the average number of references. This ensures that the
search results prioritize documents considering how well they conform to the
typical reference patterns within the targeted theme.
- Scaling: FROM 0.0 = least referenced document TO 1.0 = most referenced
document

1. **Didactic**: This score evaluates the informational value within content
chunk. It scores higher based the quality and readability of information
provided. Documents with high didactic scores often contain rich textual
information, explanations, and details. Implementation of the score looks for
signs of the opposite to compute its score - for example, the presence of a
large proportion of tabular data which indicate data dumps from spreadsheets
or databases.
- Scaling: FROM 0.0 = mostly tabular data or information that is not
expected to be read sequentially by a user TO 1.0 = contains rich
textual information, explanations, and details

1. **Guidance**: This score pertains to content chunks extracted from
guidance-oriented pages, emphasizing their significance and relevance.
Guidance pages typically offer comprehensive direction, instruction, or
expert advice within a specific domain. As these pages tend to provide
crucial information or instructions sought by users, they are given priority
to ensure users can readily access the most helpful and directive content.
- FROM Scaling: 0.0 = doesn't include crucial information or instructions
TO 1.0 = includes crucial information or instructions

By incorporating these scoring parameters, we fine-tune the document retrieval
process to align with user needs. It allows us to prioritize documents that are
not only recent, popular, and representative but also closely related to the
user's specific search criteria. This multi-faceted approach enhances the
efficiency and effectiveness of our document retrieval system, ensuring a more
tailored and user-friendly experience.

## Future

In addition to our current considerations, we can explore the integration of
thematic context into our scoring system. Thematic context involves a specific
focus on the subject or theme related to the user's query, ensuring that the
context is taken into account during the initial score calculation. To implement
this, we would need to incorporate topic labels for documents, a feature not yet
incorporated in our system. Planning for such additional scores allows us to
enhance the depth and relevance of our responses by considering the specific
themes associated with user queries.
8 changes: 2 additions & 6 deletions sql/2023-07-24-comment-score_type.sql
Original file line number Diff line number Diff line change
@@ -1,9 +1,5 @@
comment on type score_type is
$COMMENT$
score_type defines type of measurements that are taken into account.

current is relative to if the document is marked as archive (0) or not (1)
traffic is number of pageviews from the web logs (relative to all other documents traffic)
recency is how recent the document was created or updated (relative to all other documents)
typicality is how close the number of site references for this document is close to the average
$COMMENT$;
For more details about each scores, please read https://github.com/ai-cfia/ailab-db/blob/main/search-scores-and-weights.md
$COMMENT$;

0 comments on commit 7db06b4

Please sign in to comment.