Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue #9, Developing a Search Function Test Utilizing LLM #47

Merged
merged 149 commits into from
Mar 7, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
149 commits
Select commit Hold shift + click to select a range
f3db963
issue #41 - creation of the sql view
melanie-fressard Nov 2, 2023
9362d33
issue #44: fix package configuration
k-allagbe Nov 8, 2023
acb8b5d
issue #44: fix module import
k-allagbe Nov 8, 2023
17c7877
Merge remote-tracking branch 'origin/main' into 41-individual-scoring
melanie-fressard Nov 9, 2023
da2ad52
issue #41 - adding avg score
melanie-fressard Nov 9, 2023
8ba87c8
modification of .env.template to match standards
melanie-fressard Nov 9, 2023
9adc2d8
imports modification
melanie-fressard Nov 9, 2023
31df218
Fixes #9, new script
Nov 14, 2023
4f94a84
issue#24, change louis.db for ailab.db
Nov 14, 2023
c90769a
issue #41 - creation of init in ailab/db/finesse
melanie-fressard Nov 14, 2023
e668662
issue #20 - moving search.py
melanie-fressard Nov 15, 2023
30f6069
issue #41 - script to use the code
melanie-fressard Nov 15, 2023
67eca6e
issue #41 - debbug connexion db
melanie-fressard Nov 15, 2023
c97cd7d
issue #41 - debbug sql function
melanie-fressard Nov 15, 2023
21c6614
issue #41 - formatting
melanie-fressard Nov 16, 2023
403f782
issue #41 - separating creation and select
melanie-fressard Nov 16, 2023
32bc58a
issue #41 - csv file output + correction sql
melanie-fressard Nov 16, 2023
a1d7ab3
issue #41 - suppressing similarity column
melanie-fressard Nov 17, 2023
a3d10d2
issue #41 - reworking output making
melanie-fressard Nov 17, 2023
4924fc7
changes in template
melanie-fressard Nov 20, 2023
68cf448
issue #49 - modification from changing repo name
melanie-fressard Nov 20, 2023
3c38ab7
issue #49 - results file by query name
melanie-fressard Nov 20, 2023
ecfc757
Adding base script
JolanThomassin Nov 21, 2023
a687fdc
adressing issue #48 and final line
melanie-fressard Nov 21, 2023
92f327a
issue #49 - minor correction + script to launch
melanie-fressard Nov 21, 2023
08401f8
Fixes #9, Query and LLM Q&A generation
JolanThomassin Nov 21, 2023
7a195f1
changes of the call pf the search function
melanie-fressard Nov 21, 2023
6888691
issue #49 - search output
melanie-fressard Nov 21, 2023
67d5b63
line eof
melanie-fressard Nov 21, 2023
8f34a6c
Enhances Chunk Selection Quality - Resolves #9
JolanThomassin Nov 21, 2023
93841ef
lintests fails attempt to correct
melanie-fressard Nov 22, 2023
16a5528
lintests fails attempt to correct
melanie-fressard Nov 22, 2023
4363ef1
Merge branch '49-expand-search-examples' of https://github.com/ai-cfi…
melanie-fressard Nov 22, 2023
baaa739
issue #41 - eof line
melanie-fressard Nov 22, 2023
963c049
issue #49 - secrets correction
melanie-fressard Nov 22, 2023
7f436f1
issue #41 - eof without tabs
melanie-fressard Nov 23, 2023
bb483ed
issue #49 - removing search balise
melanie-fressard Nov 23, 2023
1e40ae9
eof
melanie-fressard Nov 23, 2023
009d2de
Fixes #9, new SQL files
JolanThomassin Nov 23, 2023
53758f5
issue #41 - avg instead of sum
melanie-fressard Nov 23, 2023
eb5e1e1
Fixes #9, save, check tokens, better query
JolanThomassin Nov 23, 2023
9bd6b6c
issue #49 - linttest correction
melanie-fressard Nov 24, 2023
27bec9f
test
melanie-fressard Nov 24, 2023
ffb7126
issue #31 - correction of lint test
melanie-fressard Nov 24, 2023
7f4d360
Merge pull request #50 from ai-cfia/49-expand-search-examples
melanie-fressard Nov 24, 2023
9428c57
issue #31 - resolve lint test
melanie-fressard Nov 24, 2023
080a005
issue #31 - lint test merge
melanie-fressard Nov 24, 2023
352b767
issue #41 - solve merge conflict
melanie-fressard Nov 24, 2023
31ee685
Merge branch 'main' into 41-individual-scoring
melanie-fressard Nov 24, 2023
776cbfb
Merge pull request #42 from ai-cfia/41-individual-scoring
melanie-fressard Nov 24, 2023
6446b30
issue #54: workflow call to main
vivalareda Nov 24, 2023
f5ca3e4
Merge pull request #55 from ai-cfia/issue-54-fix-workflow-for-ailab-db
vivalareda Nov 24, 2023
8871006
issue #56: run deploy only on main
vivalareda Nov 25, 2023
b862a4a
Merge branch 'main' into k-allagbe/issue44-package-submodule-configur…
k-allagbe Nov 25, 2023
3d44f7c
Merge pull request #60 from ai-cfia/k-allagbe/issue44-package-submodu…
k-allagbe Nov 27, 2023
f9e0bc7
Bump certifi from 2023.5.7 to 2023.7.22
dependabot[bot] Nov 27, 2023
e7cec22
Bump urllib3 from 2.0.3 to 2.0.7
dependabot[bot] Nov 27, 2023
d8079a5
Bump aiohttp from 3.8.4 to 3.8.6
dependabot[bot] Nov 27, 2023
59cf33b
Merge branch 'main' into 56-run-deploy-workflow-step-only-if-branch-i…
vivalareda Nov 27, 2023
ffe988a
Merge pull request #57 from ai-cfia/56-run-deploy-workflow-step-only-…
vivalareda Nov 27, 2023
021b3cf
Merge branch 'main' into dependabot/pip/certifi-2023.7.22
melanie-fressard Nov 27, 2023
782a6a2
Merge branch 'main' into dependabot/pip/urllib3-2.0.7
melanie-fressard Nov 27, 2023
7ece22e
Merge branch 'main' into dependabot/pip/aiohttp-3.8.6
melanie-fressard Nov 27, 2023
e9164f6
Merge pull request #53 from ai-cfia/dependabot/pip/urllib3-2.0.7
melanie-fressard Nov 27, 2023
6ddcca1
Merge branch 'main' into dependabot/pip/aiohttp-3.8.6
melanie-fressard Nov 27, 2023
f372a04
Merge pull request #51 from ai-cfia/dependabot/pip/aiohttp-3.8.6
melanie-fressard Nov 27, 2023
6c382d9
Merge branch 'main' into dependabot/pip/certifi-2023.7.22
melanie-fressard Nov 27, 2023
f29d8d6
Merge pull request #52 from ai-cfia/dependabot/pip/certifi-2023.7.22
melanie-fressard Nov 27, 2023
eea27c0
Bump aiohttp from 3.8.6 to 3.9.0
dependabot[bot] Nov 28, 2023
6e58fa0
Fixes #9, new SQL script
JolanThomassin Nov 28, 2023
62ac36d
issue #61 - add md file
melanie-fressard Nov 28, 2023
5733ba0
log on schema
melanie-fressard Nov 29, 2023
8b6b929
fix of lint tests
melanie-fressard Nov 29, 2023
1bfa1cf
ruff error
melanie-fressard Nov 29, 2023
8607c58
Merge pull request #63 from ai-cfia/dependabot/pip/aiohttp-3.9.0
melanie-fressard Nov 29, 2023
6525f59
issue #61 - adding jolan's scores
melanie-fressard Nov 29, 2023
9b21633
issue #61 - adding a title and last updated date
melanie-fressard Nov 29, 2023
f83b7b3
Fixes #9, refactored code
JolanThomassin Nov 29, 2023
8fca008
Fixes #9, black formatter
JolanThomassin Nov 29, 2023
27b5df4
issue #61 - added similarity
melanie-fressard Nov 29, 2023
0e0365c
issue #61 - add file where each score is computed
melanie-fressard Dec 1, 2023
3e93698
Merge branch 'main' into 61-explain-scores-and-weights
melanie-fressard Dec 1, 2023
f1a39a4
issue #61 - removing common standards
melanie-fressard Dec 1, 2023
6ae4b00
Merge remote-tracking branch 'refs/remotes/origin/61-explain-scores-a…
melanie-fressard Dec 1, 2023
1f06baa
issue #61 - adding scale for each score
melanie-fressard Dec 1, 2023
656f100
issue #61 - changing description of didactic
melanie-fressard Dec 5, 2023
c8d5b51
issue #61 - link to file
melanie-fressard Dec 5, 2023
40f270e
adding future scores
melanie-fressard Dec 6, 2023
7db06b4
Merge pull request #64 from ai-cfia/61-explain-scores-and-weights
melanie-fressard Dec 11, 2023
5e716ce
Fixes #9, new SQL scripts
JolanThomassin Dec 12, 2023
8d2e67f
Fixes #9, code clarification
JolanThomassin Dec 12, 2023
afffd40
Fixes #9, unit test for search qna function
JolanThomassin Dec 12, 2023
42965b0
Fixes #9, set schema fix
JolanThomassin Dec 13, 2023
2c0e141
Fixes #9, adding seed to get random chunk
JolanThomassin Dec 13, 2023
4f869e7
Fixes #9, script rename
JolanThomassin Dec 13, 2023
12448f0
Fixes #9, delete old script
JolanThomassin Dec 13, 2023
86adfa3
Fixes #9, renaming scripts mistakes
JolanThomassin Dec 13, 2023
9e4aa52
Fixes #9, cursor only open once
JolanThomassin Dec 13, 2023
a918685
Fixes #9, black formatter
JolanThomassin Dec 13, 2023
e9e0533
Fixes #9, character length
Jan 15, 2024
4f8664d
Fixes #9, magic string
Jan 15, 2024
6d85615
Fixes #9, argparse
Jan 15, 2024
384e58e
Fixes #9, new script
Nov 14, 2023
9fa9902
issue#24, change louis.db for ailab.db
Nov 14, 2023
6bc85ed
Fixes #9, Query and LLM Q&A generation
JolanThomassin Nov 21, 2023
38d8e42
Enhances Chunk Selection Quality - Resolves #9
JolanThomassin Nov 21, 2023
c8e68bb
Fixes #9, new SQL files
JolanThomassin Nov 23, 2023
02252f0
Fixes #9, save, check tokens, better query
JolanThomassin Nov 23, 2023
12b3042
Fixes #9, new SQL script
JolanThomassin Nov 28, 2023
5116063
Fixes #9, refactored code
JolanThomassin Nov 29, 2023
e509c4b
Fixes #9, black formatter
JolanThomassin Nov 29, 2023
9b2f83b
Fixes #9, new SQL scripts
JolanThomassin Dec 12, 2023
f3585d3
Fixes #9, code clarification
JolanThomassin Dec 12, 2023
3b89bf6
Fixes #9, unit test for search qna function
JolanThomassin Dec 12, 2023
b172f5f
Fixes #9, set schema fix
JolanThomassin Dec 13, 2023
8fb2add
Fixes #9, adding seed to get random chunk
JolanThomassin Dec 13, 2023
daffde7
Fixes #9, script rename
JolanThomassin Dec 13, 2023
f1d6ec5
Fixes #9, delete old script
JolanThomassin Dec 13, 2023
131830f
Fixes #9, renaming scripts mistakes
JolanThomassin Dec 13, 2023
8df4ba4
Fixes #9, cursor only open once
JolanThomassin Dec 13, 2023
0e9dfe5
Fixes #9, black formatter
JolanThomassin Dec 13, 2023
a625206
Fixes #9, character length
Jan 15, 2024
408acb5
Fixes #9, magic string
Jan 15, 2024
53e7494
Fixes #9, argparse
Jan 15, 2024
d08f373
Merge remote-tracking branch 'origin/issue#9-search-function-test-jt'…
Feb 1, 2024
2565a5c
Fixes #9, fixed ruff error
Feb 8, 2024
3d1266d
Fixes #9, file rename
Feb 8, 2024
c25f8fc
Fixes #9, test first function
Feb 8, 2024
a110fad
Fixes #9, clearer JSON template
JolanThomassin Feb 8, 2024
9215998
Fixes #9, missing line break
JolanThomassin Feb 8, 2024
4c02761
Fixes #9, path changes for test
JolanThomassin Feb 8, 2024
b1a769e
Fixes #9, new ENV var for schema
JolanThomassin Feb 8, 2024
6b3ede6
Fixes #9, test_generate_question
JolanThomassin Feb 8, 2024
ee5374e
Fixes #9, lint ruff error
JolanThomassin Feb 8, 2024
b721451
Fixes #9, add black formatter extension
JolanThomassin Feb 8, 2024
418399b
Fixes #9, add semver to requirements
JolanThomassin Feb 8, 2024
39e159d
Fixes #9, changes semver version
JolanThomassin Feb 8, 2024
b86fbe9
Fixes #9, test for db failure
JolanThomassin Feb 12, 2024
b7e5b56
Fixes #9, replace sys.exit(1)
JolanThomassin Feb 12, 2024
2046aa7
Fixes #9, import removed
JolanThomassin Feb 22, 2024
a40b7d5
Fixes #9, separate save for test
JolanThomassin Feb 22, 2024
7ecb190
Fixes #9, import at the top
JolanThomassin Feb 22, 2024
252d3b3
Fixes #9, import at the top
JolanThomassin Feb 22, 2024
d19bb55
Fixes #9, fixed number of generated question
JolanThomassin Feb 26, 2024
f1b016a
Fixes #9, adding "question_quality" variable
JolanThomassin Feb 26, 2024
ca58322
Fixes #9, random query new method
JolanThomassin Feb 29, 2024
8cd1664
Fixes #9, adding parameter into call
JolanThomassin Feb 29, 2024
fa67e42
Fixes #9, user_prompt more example
JolanThomassin Feb 29, 2024
71a436f
Fixes #9, remove "question_quality" variable
JolanThomassin Mar 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 23 additions & 3 deletions ailab/db/finesse/test_queries/__init__.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,19 @@
def get_random_chunk(cursor):
query = """
SELECT dc.score AS score, cr.id AS crawl_id, ch.id AS chunk_id, ch.title, cr.url AS crawl_url, ch.text_content, ch.text_content
SELECT dc.score AS score, cr.id AS crawl_id, ch.id AS chunk_id, ch.title, cr.url AS crawl_url, ch.text_content
FROM Chunk ch
INNER JOIN html_content_to_chunk hctc ON ch.id = hctc.chunk_id
INNER JOIN html_content hc ON hctc.md5hash = hc.md5hash
INNER JOIN crawl cr ON hc.md5hash = cr.md5hash
INNER JOIN documents dc ON ch.id = dc.chunk_id
WHERE dc.score > 0.0
WHERE dc.score > 0.0
AND EXISTS (
SELECT 1
FROM score sc
WHERE sc.entity_id = ch.id
AND sc.score_type = 'current'
AND sc.score > 0.0
)
ORDER BY RANDOM()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorting through all the table to pick up a single element is enormously expensive.

how about:

SELECT
	*
FROM
	table_name OFFSET floor(random() * (
		SELECT
			COUNT(*)
			FROM table_name))
LIMIT 1;

https://www.postgresql.org/docs/current/queries-limit.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still not addressed @JolanThomassin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it, it look a little faster now, but the quality stay the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely way faster.

LIMIT 1;
"""
Expand All @@ -15,7 +22,20 @@ def get_random_chunk(cursor):

def chunk_test_quality(cursor):
query = """

SELECT
ch.id AS chunk_score_id,
hc.md5hash AS md5hash_content_to_chunk,
hc.content AS html_content
FROM
louis_006.chunk_score ch
LEFT JOIN
louis_006.html_content_to_chunk hctc ON ch.id = hctc.chunk_id
LEFT JOIN
louis_006.html_content hc ON hctc.md5hash = hc.md5hash
WHERE
ch.score > 0.9
LIMIT
1;
"""
cursor.execute(query)
return cursor.fetchall()
26 changes: 17 additions & 9 deletions bin/search-function-test-utilizing-llm.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
import os
k-allagbe marked this conversation as resolved.
Show resolved Hide resolved
import sys
import json
from datetime import date

import ailab.db as db
import ailab.db.finesse as finesse
from ailab.models import openai

from ailab.db.finesse.test_queries import get_random_chunk
from ailab.db.finesse.test_queries import chunk_test_quality

TEST_VERSION = "v001"
WANTED_GENERATED_QUESTIONS = 10
TEST_VERSION = date.today()
WANTED_GENERATED_QUESTIONS = 5
CHARACTER_LIMIT = 14383


Expand All @@ -34,7 +35,15 @@ def main():
print("System Prompt:", system_prompt + "\n")
print("User Prompt:", user_prompt + "\n")

average_tokens_by_chunk = 0
AVERAGE_TOKENS_BY_CHUNK = 0

### WIP - TESTING NEW QUERY - WIP ###
with project_db.cursor() as cursor:
random_chunk = chunk_test_quality(cursor)
print(random_chunk)
### WIP - TESTING NEW QUERY - WIP ###

"""
for i in range(WANTED_GENERATED_QUESTIONS):
random_chunk = ""

Expand All @@ -57,9 +66,7 @@ def main():
)

total_length = len(system_prompt) + len(constructed_user_prompt)
print("Token limit : " + str(CHARACTER_LIMIT))
print("Prompt character : " + str(total_length) + "\n")
average_tokens_by_chunk += total_length
AVERAGE_TOKENS_BY_CHUNK += total_length
if total_length < CHARACTER_LIMIT:
response = openai.get_chat_answer(
system_prompt, constructed_user_prompt, 2000
Expand All @@ -86,8 +93,9 @@ def main():
print("File saved into: " + file_path)
json.dump(data, json_file, ensure_ascii=False, indent=4)

average_tokens_by_chunk = average_tokens_by_chunk / WANTED_GENERATED_QUESTIONS
print("Average Tokens send to the API : " + str(average_tokens_by_chunk))
AVERAGE_TOKENS_BY_CHUNK = AVERAGE_TOKENS_BY_CHUNK / WANTED_GENERATED_QUESTIONS
print("Average Tokens send to the API : " + str(AVERAGE_TOKENS_BY_CHUNK))
"""


if __name__ == "__main__":
Expand Down
36 changes: 36 additions & 0 deletions sql/2023-11-28-chunk-didactic-score.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
-- Set the search path to the louis_006 schema
SET search_path TO louis_006;

CREATE TABLE IF NOT EXISTS chunk_score (
id UUID,
score FLOAT,
score_type VARCHAR(50)
);

TRUNCATE TABLE chunk_score;

INSERT INTO chunk_score (id, score, score_type)
SELECT
ch.id, -- Use the id column from the chunk table
ROUND(
(
LENGTH(hc.content) - length_values.min_val
) * 1.0 / (length_values.max_val - length_values.min_val),
1
) AS tr_proportion,
'didactic' AS score_type
FROM
louis_006.chunk ch
INNER JOIN louis_006.html_content_to_chunk hctc ON ch.id = hctc.chunk_id
INNER JOIN louis_006.html_content hc ON hctc.md5hash = hc.md5hash
CROSS JOIN (
SELECT
MIN(LENGTH(content)) AS min_val,
MAX(LENGTH(content)) AS max_val
FROM
louis_006.chunk ch
INNER JOIN louis_006.html_content_to_chunk hctc ON ch.id = hctc.chunk_id
INNER JOIN louis_006.html_content hc ON hctc.md5hash = hc.md5hash
) AS length_values
ORDER BY
tr_proportion DESC;
6 changes: 6 additions & 0 deletions sql/2023-11-28-create-histogram.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
SELECT
score,
count(*) as count
FROM louis_006.chunk_score
GROUP BY score
ORDER BY score;
3 changes: 3 additions & 0 deletions sql/2023-11-28-print-schema-table.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
SELECT table_name, column_name, data_type
FROM information_schema.columns
WHERE table_schema = 'louis_006';