-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #64 from ai-cfia/61-explain-scores-and-weights
Adding a md file explaining scores and weights
- Loading branch information
Showing
2 changed files
with
98 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
# Scores and weights for search function | ||
|
||
At the re-ranking stage, our search system uses scoring to assign values to each | ||
document based on different parameters. The interplay between clustering and | ||
scoring helps optimize the search process, ensuring that the system considers | ||
both the content and context to deliver more accurate and relevant results for | ||
the user. | ||
|
||
This task becomes challenging when dealing with a large number of documents, | ||
creating the need for optimization strategies. A good approach is the | ||
utilization of an indexing method called clustering, which categorize documents | ||
based on their topics, facilitating a more streamlined and efficient search | ||
process. We also use scoring, which involves the assignment of numerical values | ||
and weights to documents based on various parameters, influencing their ranking | ||
in the search results. Here is our different scores, each serving a unique | ||
purpose: | ||
|
||
1. [**Similarity**](sql/2023-07-19-modify-score_type-add-similarity.sql): | ||
Represents the primary signal in our scoring mechanism. It denotes the | ||
measurement of how closely a document aligns with the user's query, | ||
reflecting the relevance of the document to the search criteria. It is not a | ||
static precomputed score but a dynamic metric that is computed on the fly | ||
from the search results. This ensures that the relevance is tied to the | ||
specific query, going beyond simple keyword matching. Documents with higher | ||
similarity scores are considered more relevant. By prioritizing similarity as | ||
the first signal in our scoring process, we aim to deliver search results | ||
that are more accurate. | ||
- Scaling: FROM 0.0 = least similar to user query TO 1.0 = most similar to | ||
user query | ||
|
||
1. [**Recency**](sql/schema2.sql.public_new.sql): This score considers the | ||
temporal aspect of documents, prioritizing recently added or updated content. | ||
A document's recency is crucial in reflecting the latest information | ||
available to users. | ||
- Scaling: FROM 0.0 = oldest document TO 1.0 = most recent document | ||
|
||
1. [**Traffic**](sql/compute-traffic-score.sql): The frequency with which users | ||
consult a document influences its score. Popular or frequently accessed | ||
documents are given higher scores with the help of web traffic logs, | ||
indicating their relevance and importance to users. Warning: The home page is | ||
rated really high since it's where every user land at first. | ||
- Scaling: FROM 0.0 = least consulted document TO 1.0 = most consulted | ||
document | ||
|
||
1. [**Current**](sql/2023-07-12-score-current.sql): This score determines | ||
whether a document is currently accessible or if it has been archived. It | ||
helps users distinguish between active and inactive content. | ||
- Scaling: 0.0 = currently accessible document **OR** 1.0 = archived | ||
document | ||
|
||
1. [**Typicality**](sql/2023-07-12-calculate-incoming-outgoing-counts.sql): This | ||
score evaluates how closely the number of site references for a document | ||
aligns with the average. Documents with typicality scores reflect a level of | ||
correspondence with the average number of references. This ensures that the | ||
search results prioritize documents considering how well they conform to the | ||
typical reference patterns within the targeted theme. | ||
- Scaling: FROM 0.0 = least referenced document TO 1.0 = most referenced | ||
document | ||
|
||
1. **Didactic**: This score evaluates the informational value within content | ||
chunk. It scores higher based the quality and readability of information | ||
provided. Documents with high didactic scores often contain rich textual | ||
information, explanations, and details. Implementation of the score looks for | ||
signs of the opposite to compute its score - for example, the presence of a | ||
large proportion of tabular data which indicate data dumps from spreadsheets | ||
or databases. | ||
- Scaling: FROM 0.0 = mostly tabular data or information that is not | ||
expected to be read sequentially by a user TO 1.0 = contains rich | ||
textual information, explanations, and details | ||
|
||
1. **Guidance**: This score pertains to content chunks extracted from | ||
guidance-oriented pages, emphasizing their significance and relevance. | ||
Guidance pages typically offer comprehensive direction, instruction, or | ||
expert advice within a specific domain. As these pages tend to provide | ||
crucial information or instructions sought by users, they are given priority | ||
to ensure users can readily access the most helpful and directive content. | ||
- FROM Scaling: 0.0 = doesn't include crucial information or instructions | ||
TO 1.0 = includes crucial information or instructions | ||
|
||
By incorporating these scoring parameters, we fine-tune the document retrieval | ||
process to align with user needs. It allows us to prioritize documents that are | ||
not only recent, popular, and representative but also closely related to the | ||
user's specific search criteria. This multi-faceted approach enhances the | ||
efficiency and effectiveness of our document retrieval system, ensuring a more | ||
tailored and user-friendly experience. | ||
|
||
## Future | ||
|
||
In addition to our current considerations, we can explore the integration of | ||
thematic context into our scoring system. Thematic context involves a specific | ||
focus on the subject or theme related to the user's query, ensuring that the | ||
context is taken into account during the initial score calculation. To implement | ||
this, we would need to incorporate topic labels for documents, a feature not yet | ||
incorporated in our system. Planning for such additional scores allows us to | ||
enhance the depth and relevance of our responses by considering the specific | ||
themes associated with user queries. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,5 @@ | ||
comment on type score_type is | ||
$COMMENT$ | ||
score_type defines type of measurements that are taken into account. | ||
|
||
current is relative to if the document is marked as archive (0) or not (1) | ||
traffic is number of pageviews from the web logs (relative to all other documents traffic) | ||
recency is how recent the document was created or updated (relative to all other documents) | ||
typicality is how close the number of site references for this document is close to the average | ||
$COMMENT$; | ||
For more details about each scores, please read https://github.com/ai-cfia/ailab-db/blob/main/search-scores-and-weights.md | ||
$COMMENT$; |