From 62ac36dd7ff02eaf233989181c36d158341037e3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?M=C3=A9lanie=20F?= <73828657+melanie-fressard@users.noreply.github.com> Date: Tue, 28 Nov 2023 20:29:29 +0000 Subject: [PATCH 01/10] issue #61 - add md file --- search-scores-and-weights.md | 40 ++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) create mode 100644 search-scores-and-weights.md diff --git a/search-scores-and-weights.md b/search-scores-and-weights.md new file mode 100644 index 0000000..9eb7447 --- /dev/null +++ b/search-scores-and-weights.md @@ -0,0 +1,40 @@ +Our search system uses scoring to assign values to each document based on +different parameters. The interplay between clustering and scoring helps +optimize the search process, ensuring that the system considers both the content +and context to deliver more accurate and relevant results for the user. + +This task becomes challenging when dealing with a large number of documents, +creating the need for optimization strategies. A good approach is the +utilization of clustering techniques, which categorize documents based on their +topics, facilitating a more streamlined and efficient search process. We also +use scoring, which involves the assignment of numerical values and weights to +documents based on various parameters, influencing their ranking in the search +results. Here is our different scores, each serving a unique purpose: + +1. **Recency**: This score considers the temporal aspect of documents, + prioritizing recently added or updated content. A document's recency is + crucial in reflecting the latest information available to users. + +1. **Traffic**: The frequency with which users consult a document influences its + score. Popular or frequently accessed documents are given higher scores, + indicating their relevance and importance to users. Warning: The home page is + rated really high since it's where every user land at first. + +1. **Current**: This score determines whether a document is currently accessible + (current=1) or if it has been archived (current=0). It helps users + distinguish between active and inactive content. + +1. **Typicality**: This score assesses whether a document is a typical + informative representation of the broader concept or topic being queried. + Typical documents are considered representative examples within a given + context and are assigned higher scores. This ensures that the search results + prioritize documents that serve as typical and informative instances within + the targeted topic. + + +By incorporating these scoring parameters, we fine-tune the document retrieval +process to align with user needs. It allows us to prioritize documents that are +not only recent, popular, and representative but also closely related to the +user's specific search criteria. This multi-faceted approach enhances the +efficiency and effectiveness of our document retrieval system, ensuring a more +tailored and user-friendly experience. From 6525f5942c4be3b038b37daeebda4a16098fe554 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?M=C3=A9lanie=20F?= <73828657+melanie-fressard@users.noreply.github.com> Date: Wed, 29 Nov 2023 19:46:43 +0000 Subject: [PATCH 02/10] issue #61 - adding jolan's scores --- search-scores-and-weights.md | 38 ++++++++++++++++++++++++++---------- 1 file changed, 28 insertions(+), 10 deletions(-) diff --git a/search-scores-and-weights.md b/search-scores-and-weights.md index 9eb7447..3bed563 100644 --- a/search-scores-and-weights.md +++ b/search-scores-and-weights.md @@ -1,7 +1,8 @@ -Our search system uses scoring to assign values to each document based on -different parameters. The interplay between clustering and scoring helps -optimize the search process, ensuring that the system considers both the content -and context to deliver more accurate and relevant results for the user. +At the re-ranking stage, our search system uses scoring to assign values to each +document based on different parameters. The interplay between clustering and +scoring helps optimize the search process, ensuring that the system considers +both the content and context to deliver more accurate and relevant results for +the user. This task becomes challenging when dealing with a large number of documents, creating the need for optimization strategies. A good approach is the @@ -24,12 +25,29 @@ results. Here is our different scores, each serving a unique purpose: (current=1) or if it has been archived (current=0). It helps users distinguish between active and inactive content. -1. **Typicality**: This score assesses whether a document is a typical - informative representation of the broader concept or topic being queried. - Typical documents are considered representative examples within a given - context and are assigned higher scores. This ensures that the search results - prioritize documents that serve as typical and informative instances within - the targeted topic. +1. **Typicality**: This score evaluates how closely the number of site + references for a document aligns with the average within a given thematic + context. Documents with typicality scores reflect a level of correspondence + with the average number of references, indicating their alignment with the + common standards within the specified topic. This ensures that the search + results prioritize documents not only based on their relevance to the user's + query but also considering how well they conform to the typical reference + patterns within the targeted theme. + +1. **Didactic**: This score evaluates the informational value within content + chunks that lack a defined structure such as tables. It focuses on the + quality and depth of information provided without relying on tabular formats + for organization or presentation. Documents with high didactic scores often + contain rich textual information, explanations, and details without reliance + on tabulated data. + +1. **Guidance**: This score pertains to content chunks extracted from + guidance-oriented pages, emphasizing their significance and relevance. + Guidance pages typically offer comprehensive direction, instruction, or + expert advice within a specific domain. As these pages tend to provide + crucial information or instructions sought by users, they are given priority + to ensure users can readily access the most helpful and directive content. + By incorporating these scoring parameters, we fine-tune the document retrieval From 9b216334aab59e6731eb9a6ea5d00aeac7ee526c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?M=C3=A9lanie=20F?= <73828657+melanie-fressard@users.noreply.github.com> Date: Wed, 29 Nov 2023 19:50:28 +0000 Subject: [PATCH 03/10] issue #61 - adding a title and last updated date --- search-scores-and-weights.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/search-scores-and-weights.md b/search-scores-and-weights.md index 3bed563..a42b4d5 100644 --- a/search-scores-and-weights.md +++ b/search-scores-and-weights.md @@ -1,3 +1,6 @@ +# Scores and weights for search function +### Last updated: 2023-11-29 + At the re-ranking stage, our search system uses scoring to assign values to each document based on different parameters. The interplay between clustering and scoring helps optimize the search process, ensuring that the system considers From 27b5df41375a92e45650303477bfe9ab48f14780 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?M=C3=A9lanie=20F?= <73828657+melanie-fressard@users.noreply.github.com> Date: Wed, 29 Nov 2023 20:46:16 +0000 Subject: [PATCH 04/10] issue #61 - added similarity --- search-scores-and-weights.md | 30 +++++++++++++++++++++--------- 1 file changed, 21 insertions(+), 9 deletions(-) diff --git a/search-scores-and-weights.md b/search-scores-and-weights.md index a42b4d5..59f0ac1 100644 --- a/search-scores-and-weights.md +++ b/search-scores-and-weights.md @@ -5,24 +5,36 @@ At the re-ranking stage, our search system uses scoring to assign values to each document based on different parameters. The interplay between clustering and scoring helps optimize the search process, ensuring that the system considers both the content and context to deliver more accurate and relevant results for -the user. +the user. This task becomes challenging when dealing with a large number of documents, creating the need for optimization strategies. A good approach is the -utilization of clustering techniques, which categorize documents based on their -topics, facilitating a more streamlined and efficient search process. We also -use scoring, which involves the assignment of numerical values and weights to -documents based on various parameters, influencing their ranking in the search -results. Here is our different scores, each serving a unique purpose: +utilization of an indexing method called clustering, which categorize documents +based on their topics, facilitating a more streamlined and efficient search +process. We also use scoring, which involves the assignment of numerical values +and weights to documents based on various parameters, influencing their ranking +in the search results. Here is our different scores, each serving a unique +purpose: + +1. **Similarity**: Represents the primary signal in our scoring mechanism. It + denotes the measurement of how closely a document aligns with the user's + query, reflecting the relevance of the document to the search criteria. It is + not a static precomputed score but a dynamic metric that is computed on the + fly from the search results. This ensures that the relevance is tied to the + specific query, going beyond simple keyword matching. Documents with higher + similarity scores are considered more relevant. By prioritizing similarity as + the first signal in our scoring process, we aim to deliver search results + that are more accurate. 1. **Recency**: This score considers the temporal aspect of documents, prioritizing recently added or updated content. A document's recency is crucial in reflecting the latest information available to users. 1. **Traffic**: The frequency with which users consult a document influences its - score. Popular or frequently accessed documents are given higher scores, - indicating their relevance and importance to users. Warning: The home page is - rated really high since it's where every user land at first. + score. Popular or frequently accessed documents are given higher scores with + the help of web traffic logs, indicating their relevance and importance to + users. Warning: The home page is rated really high since it's where every + user land at first. 1. **Current**: This score determines whether a document is currently accessible (current=1) or if it has been archived (current=0). It helps users From 0e0365c24dcdc11fec5efd29312a3a87a4df1b19 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?M=C3=A9lanie=20F?= <73828657+melanie-fressard@users.noreply.github.com> Date: Fri, 1 Dec 2023 14:51:37 +0000 Subject: [PATCH 05/10] issue #61 - add file where each score is computed --- search-scores-and-weights.md | 53 ++++++++++++++------------- sql/2023-07-24-comment-score_type.sql | 8 +--- 2 files changed, 30 insertions(+), 31 deletions(-) diff --git a/search-scores-and-weights.md b/search-scores-and-weights.md index 59f0ac1..4a5d4b8 100644 --- a/search-scores-and-weights.md +++ b/search-scores-and-weights.md @@ -1,5 +1,4 @@ # Scores and weights for search function -### Last updated: 2023-11-29 At the re-ranking stage, our search system uses scoring to assign values to each document based on different parameters. The interplay between clustering and @@ -16,38 +15,42 @@ and weights to documents based on various parameters, influencing their ranking in the search results. Here is our different scores, each serving a unique purpose: -1. **Similarity**: Represents the primary signal in our scoring mechanism. It - denotes the measurement of how closely a document aligns with the user's - query, reflecting the relevance of the document to the search criteria. It is - not a static precomputed score but a dynamic metric that is computed on the - fly from the search results. This ensures that the relevance is tied to the +1. [**Similarity**](sql/2023-07-19-modify-score_type-add-similarity.sql): + Represents the primary signal in our scoring mechanism. It denotes the + measurement of how closely a document aligns with the user's query, + reflecting the relevance of the document to the search criteria. It is not a + static precomputed score but a dynamic metric that is computed on the fly + from the search results. This ensures that the relevance is tied to the specific query, going beyond simple keyword matching. Documents with higher similarity scores are considered more relevant. By prioritizing similarity as the first signal in our scoring process, we aim to deliver search results that are more accurate. -1. **Recency**: This score considers the temporal aspect of documents, - prioritizing recently added or updated content. A document's recency is - crucial in reflecting the latest information available to users. +1. [**Recency**](sql/schema2.sql.public_new.sql): This score considers the + temporal aspect of documents, prioritizing recently added or updated content. + A document's recency is crucial in reflecting the latest information + available to users. -1. **Traffic**: The frequency with which users consult a document influences its - score. Popular or frequently accessed documents are given higher scores with - the help of web traffic logs, indicating their relevance and importance to - users. Warning: The home page is rated really high since it's where every - user land at first. +1. [**Traffic**](sql/compute-traffic-score.sql): The frequency with which users + consult a document influences its score. Popular or frequently accessed + documents are given higher scores with the help of web traffic logs, + indicating their relevance and importance to users. Warning: The home page is + rated really high since it's where every user land at first. -1. **Current**: This score determines whether a document is currently accessible - (current=1) or if it has been archived (current=0). It helps users - distinguish between active and inactive content. +1. [**Current**](sql/2023-07-12-score-current.sql): This score determines + whether a document is currently accessible (current=1) or if it has been + archived (current=0). It helps users distinguish between active and inactive + content. -1. **Typicality**: This score evaluates how closely the number of site - references for a document aligns with the average within a given thematic - context. Documents with typicality scores reflect a level of correspondence - with the average number of references, indicating their alignment with the - common standards within the specified topic. This ensures that the search - results prioritize documents not only based on their relevance to the user's - query but also considering how well they conform to the typical reference - patterns within the targeted theme. +1. [**Typicality**](sql/2023-07-12-calculate-incoming-outgoing-counts.sql): This + score evaluates how closely the number of site references for a document + aligns with the average within a given thematic context. Documents with + typicality scores reflect a level of correspondence with the average number + of references, indicating their alignment with the common standards within + the specified topic. This ensures that the search results prioritize + documents not only based on their relevance to the user's query but also + considering how well they conform to the typical reference patterns within + the targeted theme. 1. **Didactic**: This score evaluates the informational value within content chunks that lack a defined structure such as tables. It focuses on the diff --git a/sql/2023-07-24-comment-score_type.sql b/sql/2023-07-24-comment-score_type.sql index 30a5d59..5566144 100644 --- a/sql/2023-07-24-comment-score_type.sql +++ b/sql/2023-07-24-comment-score_type.sql @@ -1,9 +1,5 @@ comment on type score_type is $COMMENT$ score_type defines type of measurements that are taken into account. - -current is relative to if the document is marked as archive (0) or not (1) -traffic is number of pageviews from the web logs (relative to all other documents traffic) -recency is how recent the document was created or updated (relative to all other documents) -typicality is how close the number of site references for this document is close to the average -$COMMENT$; \ No newline at end of file +For more details about each scores, please read search-scores-and-weights.md +$COMMENT$; From f1a39a4259faa1754622b4e7c91284f194626648 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?M=C3=A9lanie=20F?= <73828657+melanie-fressard@users.noreply.github.com> Date: Fri, 1 Dec 2023 17:20:05 +0000 Subject: [PATCH 06/10] issue #61 - removing common standards --- search-scores-and-weights.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/search-scores-and-weights.md b/search-scores-and-weights.md index 4a5d4b8..2da5046 100644 --- a/search-scores-and-weights.md +++ b/search-scores-and-weights.md @@ -46,11 +46,10 @@ purpose: score evaluates how closely the number of site references for a document aligns with the average within a given thematic context. Documents with typicality scores reflect a level of correspondence with the average number - of references, indicating their alignment with the common standards within - the specified topic. This ensures that the search results prioritize - documents not only based on their relevance to the user's query but also - considering how well they conform to the typical reference patterns within - the targeted theme. + of references, indicating their alignment within the specified topic. This + ensures that the search results prioritize documents not only based on their + relevance to the user's query but also considering how well they conform to + the typical reference patterns within the targeted theme. 1. **Didactic**: This score evaluates the informational value within content chunks that lack a defined structure such as tables. It focuses on the From 1f06baa478b9554358b05e615ab094c5d2f05f2f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?M=C3=A9lanie=20F?= <73828657+melanie-fressard@users.noreply.github.com> Date: Fri, 1 Dec 2023 21:25:02 +0000 Subject: [PATCH 07/10] issue #61 - adding scale for each score --- search-scores-and-weights.md | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/search-scores-and-weights.md b/search-scores-and-weights.md index 2da5046..66e5fb4 100644 --- a/search-scores-and-weights.md +++ b/search-scores-and-weights.md @@ -25,22 +25,28 @@ purpose: similarity scores are considered more relevant. By prioritizing similarity as the first signal in our scoring process, we aim to deliver search results that are more accurate. + - Scaling: FROM 0.0 = least similar to user query TO 1.0 = most similar to + user query 1. [**Recency**](sql/schema2.sql.public_new.sql): This score considers the temporal aspect of documents, prioritizing recently added or updated content. A document's recency is crucial in reflecting the latest information - available to users. + available to users. + - Scaling: FROM 0.0 = oldest document TO 1.0 = most recent document 1. [**Traffic**](sql/compute-traffic-score.sql): The frequency with which users consult a document influences its score. Popular or frequently accessed documents are given higher scores with the help of web traffic logs, indicating their relevance and importance to users. Warning: The home page is rated really high since it's where every user land at first. + - Scaling: FROM 0.0 = least consulted document TO 1.0 = most consulted + document 1. [**Current**](sql/2023-07-12-score-current.sql): This score determines - whether a document is currently accessible (current=1) or if it has been - archived (current=0). It helps users distinguish between active and inactive - content. + whether a document is currently accessible or if it has been archived. It + helps users distinguish between active and inactive content. + - Scaling: 0.0 = currently accessible document **OR** 1.0 = archived + document 1. [**Typicality**](sql/2023-07-12-calculate-incoming-outgoing-counts.sql): This score evaluates how closely the number of site references for a document @@ -50,6 +56,8 @@ purpose: ensures that the search results prioritize documents not only based on their relevance to the user's query but also considering how well they conform to the typical reference patterns within the targeted theme. + - Scaling: FROM 0.0 = least referenced document TO 1.0 = most referenced + document 1. **Didactic**: This score evaluates the informational value within content chunks that lack a defined structure such as tables. It focuses on the @@ -57,6 +65,9 @@ purpose: for organization or presentation. Documents with high didactic scores often contain rich textual information, explanations, and details without reliance on tabulated data. + - Scaling: FROM 0.0 = doesn't contain really rich textual information, + explanations, and details TO 1.0 = contains rich textual information, + explanations, and details 1. **Guidance**: This score pertains to content chunks extracted from guidance-oriented pages, emphasizing their significance and relevance. @@ -64,6 +75,8 @@ purpose: expert advice within a specific domain. As these pages tend to provide crucial information or instructions sought by users, they are given priority to ensure users can readily access the most helpful and directive content. + - FROM Scaling: 0.0 = doesn't include crucial information or instructions + TO 1.0 = includes crucial information or instructions From 656f100c0d021d23f28eee7a03d30bd2a0975c77 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?M=C3=A9lanie=20F?= <73828657+melanie-fressard@users.noreply.github.com> Date: Tue, 5 Dec 2023 16:53:50 +0000 Subject: [PATCH 08/10] issue #61 - changing description of didactic --- search-scores-and-weights.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/search-scores-and-weights.md b/search-scores-and-weights.md index 66e5fb4..e7892ca 100644 --- a/search-scores-and-weights.md +++ b/search-scores-and-weights.md @@ -60,14 +60,15 @@ purpose: document 1. **Didactic**: This score evaluates the informational value within content - chunks that lack a defined structure such as tables. It focuses on the - quality and depth of information provided without relying on tabular formats - for organization or presentation. Documents with high didactic scores often - contain rich textual information, explanations, and details without reliance - on tabulated data. - - Scaling: FROM 0.0 = doesn't contain really rich textual information, - explanations, and details TO 1.0 = contains rich textual information, - explanations, and details + chunk. It scores higher based the quality and readability of information + provided. Documents with high didactic scores often contain rich textual + information, explanations, and details. Implementation of the score looks for + signs of the opposite to compute its score - for example, the presence of a + large proportion of tabular data which indicate data dumps from spreadsheets + or databases. + - Scaling: FROM 0.0 = mostly tabular data or information that is not + expected to be read sequentially by a user TO 1.0 = contains rich + textual information, explanations, and details 1. **Guidance**: This score pertains to content chunks extracted from guidance-oriented pages, emphasizing their significance and relevance. From c8d5b5135336adf428e753b1ef19588576685259 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?M=C3=A9lanie=20F?= <73828657+melanie-fressard@users.noreply.github.com> Date: Tue, 5 Dec 2023 19:13:33 +0000 Subject: [PATCH 09/10] issue #61 - link to file --- sql/2023-07-24-comment-score_type.sql | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sql/2023-07-24-comment-score_type.sql b/sql/2023-07-24-comment-score_type.sql index 5566144..a6ab66c 100644 --- a/sql/2023-07-24-comment-score_type.sql +++ b/sql/2023-07-24-comment-score_type.sql @@ -1,5 +1,5 @@ comment on type score_type is $COMMENT$ score_type defines type of measurements that are taken into account. -For more details about each scores, please read search-scores-and-weights.md +For more details about each scores, please read 'https://github.com/ai-cfia/ailab-db/blob/main/search-scores-and-weights.md' $COMMENT$; From 40f270e69fd856f2f28bef7d6341cabd18823d60 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?M=C3=A9lanie=20F?= <73828657+melanie-fressard@users.noreply.github.com> Date: Wed, 6 Dec 2023 20:46:07 +0000 Subject: [PATCH 10/10] adding future scores --- search-scores-and-weights.md | 23 +++++++++++++++-------- sql/2023-07-24-comment-score_type.sql | 2 +- 2 files changed, 16 insertions(+), 9 deletions(-) diff --git a/search-scores-and-weights.md b/search-scores-and-weights.md index e7892ca..bf39d93 100644 --- a/search-scores-and-weights.md +++ b/search-scores-and-weights.md @@ -50,12 +50,10 @@ purpose: 1. [**Typicality**](sql/2023-07-12-calculate-incoming-outgoing-counts.sql): This score evaluates how closely the number of site references for a document - aligns with the average within a given thematic context. Documents with - typicality scores reflect a level of correspondence with the average number - of references, indicating their alignment within the specified topic. This - ensures that the search results prioritize documents not only based on their - relevance to the user's query but also considering how well they conform to - the typical reference patterns within the targeted theme. + aligns with the average. Documents with typicality scores reflect a level of + correspondence with the average number of references. This ensures that the + search results prioritize documents considering how well they conform to the + typical reference patterns within the targeted theme. - Scaling: FROM 0.0 = least referenced document TO 1.0 = most referenced document @@ -79,11 +77,20 @@ purpose: - FROM Scaling: 0.0 = doesn't include crucial information or instructions TO 1.0 = includes crucial information or instructions - - By incorporating these scoring parameters, we fine-tune the document retrieval process to align with user needs. It allows us to prioritize documents that are not only recent, popular, and representative but also closely related to the user's specific search criteria. This multi-faceted approach enhances the efficiency and effectiveness of our document retrieval system, ensuring a more tailored and user-friendly experience. + +## Future + +In addition to our current considerations, we can explore the integration of +thematic context into our scoring system. Thematic context involves a specific +focus on the subject or theme related to the user's query, ensuring that the +context is taken into account during the initial score calculation. To implement +this, we would need to incorporate topic labels for documents, a feature not yet +incorporated in our system. Planning for such additional scores allows us to +enhance the depth and relevance of our responses by considering the specific +themes associated with user queries. diff --git a/sql/2023-07-24-comment-score_type.sql b/sql/2023-07-24-comment-score_type.sql index a6ab66c..9fa5663 100644 --- a/sql/2023-07-24-comment-score_type.sql +++ b/sql/2023-07-24-comment-score_type.sql @@ -1,5 +1,5 @@ comment on type score_type is $COMMENT$ score_type defines type of measurements that are taken into account. -For more details about each scores, please read 'https://github.com/ai-cfia/ailab-db/blob/main/search-scores-and-weights.md' +For more details about each scores, please read https://github.com/ai-cfia/ailab-db/blob/main/search-scores-and-weights.md $COMMENT$;