Merge pull request #64 from ai-cfia/61-explain-scores-and-weights

Adding a md file explaining scores and weights
ai-cfia · Dec 11, 2023 · 7db06b4 · 7db06b4
2 parents 8607c58 + 40f270e
commit 7db06b4
Show file tree

Hide file tree

Showing 2 changed files with 98 additions and 6 deletions.
diff --git a/search-scores-and-weights.md b/search-scores-and-weights.md
@@ -0,0 +1,96 @@
+# Scores and weights for search function
+
+At the re-ranking stage, our search system uses scoring to assign values to each
+document based on different parameters. The interplay between clustering and
+scoring helps optimize the search process, ensuring that the system considers
+both the content and context to deliver more accurate and relevant results for
+the user. 
+
+This task becomes challenging when dealing with a large number of documents,
+creating the need for optimization strategies. A good approach is the
+utilization of an indexing method called clustering, which categorize documents
+based on their topics, facilitating a more streamlined and efficient search
+process. We also use scoring, which involves the assignment of numerical values
+and weights to documents based on various parameters, influencing their ranking
+in the search results. Here is our different scores, each serving a unique
+purpose:
+
+1. [**Similarity**](sql/2023-07-19-modify-score_type-add-similarity.sql):
+   Represents the primary signal in our scoring mechanism. It denotes the
+   measurement of how closely a document aligns with the user's query,
+   reflecting the relevance of the document to the search criteria. It is not a
+   static precomputed score but a dynamic metric that is computed on the fly
+   from the search results. This ensures that the relevance is tied to the
+   specific query, going beyond simple keyword matching. Documents with higher
+   similarity scores are considered more relevant. By prioritizing similarity as
+   the first signal in our scoring process, we aim to deliver search results
+   that are more accurate.
+      - Scaling: FROM 0.0 = least similar to user query TO 1.0 = most similar to
+        user query
+
+1. [**Recency**](sql/schema2.sql.public_new.sql): This score considers the
+   temporal aspect of documents, prioritizing recently added or updated content.
+   A document's recency is crucial in reflecting the latest information
+   available to users. 
+      - Scaling: FROM 0.0 = oldest document TO 1.0 = most recent document
+
+1. [**Traffic**](sql/compute-traffic-score.sql): The frequency with which users
+   consult a document influences its score. Popular or frequently accessed
+   documents are given higher scores with the help of web traffic logs,
+   indicating their relevance and importance to users. Warning: The home page is
+   rated really high since it's where every user land at first.
+      - Scaling: FROM 0.0 = least consulted document TO 1.0 = most consulted
+        document
+
+1. [**Current**](sql/2023-07-12-score-current.sql): This score determines
+   whether a document is currently accessible or if it has been archived. It
+   helps users distinguish between active and inactive content.
+      - Scaling: 0.0 = currently accessible document **OR** 1.0 = archived
+        document
+
+1. [**Typicality**](sql/2023-07-12-calculate-incoming-outgoing-counts.sql): This
+   score evaluates how closely the number of site references for a document
+   aligns with the average. Documents with typicality scores reflect a level of
+   correspondence with the average number of references. This ensures that the
+   search results prioritize documents considering how well they conform to the
+   typical reference patterns within the targeted theme.
+      - Scaling: FROM 0.0 = least referenced document TO 1.0 = most referenced
+        document
+
+1. **Didactic**: This score evaluates the informational value within content
+   chunk. It scores higher based the quality and readability of information
+   provided. Documents with high didactic scores often contain rich textual
+   information, explanations, and details. Implementation of the score looks for
+   signs of the opposite to compute its score - for example, the presence of a
+   large proportion of tabular data which indicate data dumps from spreadsheets
+   or databases.
+      - Scaling: FROM 0.0 = mostly tabular data or information that is not
+        expected to be read sequentially by a user TO 1.0 = contains rich
+        textual information, explanations, and details
+
+1. **Guidance**: This score pertains to content chunks extracted from
+   guidance-oriented pages, emphasizing their significance and relevance.
+   Guidance pages typically offer comprehensive direction, instruction, or
+   expert advice within a specific domain. As these pages tend to provide
+   crucial information or instructions sought by users, they are given priority
+   to ensure users can readily access the most helpful and directive content.
+      - FROM Scaling: 0.0 = doesn't include crucial information or instructions
+        TO 1.0 = includes crucial information or instructions
+
+By incorporating these scoring parameters, we fine-tune the document retrieval
+process to align with user needs. It allows us to prioritize documents that are
+not only recent, popular, and representative but also closely related to the
+user's specific search criteria. This multi-faceted approach enhances the
+efficiency and effectiveness of our document retrieval system, ensuring a more
+tailored and user-friendly experience.
+
+## Future
+
+In addition to our current considerations, we can explore the integration of
+thematic context into our scoring system. Thematic context involves a specific
+focus on the subject or theme related to the user's query, ensuring that the
+context is taken into account during the initial score calculation. To implement
+this, we would need to incorporate topic labels for documents, a feature not yet
+incorporated in our system. Planning for such additional scores allows us to
+enhance the depth and relevance of our responses by considering the specific
+themes associated with user queries.
diff --git a/sql/2023-07-24-comment-score_type.sql b/sql/2023-07-24-comment-score_type.sql
@@ -1,9 +1,5 @@
 comment on type score_type is 
 $COMMENT$
 score_type defines type of measurements that are taken into account.
-
-current is relative to if the document is marked as archive (0) or not (1)
-traffic is number of pageviews from the web logs (relative to all other documents traffic)
-recency is how recent the document was created or updated (relative to all other documents)
-typicality is how close the number of site references for this document is close to the average
-$COMMENT$;
+For more details about each scores, please read https://github.com/ai-cfia/ailab-db/blob/main/search-scores-and-weights.md
+$COMMENT$;