Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #79, Testing Current Score #84

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions ailab/db/finesse/test_queries/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,3 +44,39 @@ def get_random_chunk(cursor, schema_version, seed=None):

cursor.execute(query)
return cursor.fetchall()


def get_random_document_score(cursor, schema_version, seed=None):
if seed is None:
seed = math.sin(time.time())

# Execute the SET commands separately
cursor.execute(f'SET SEARCH_PATH TO "{schema_version}", public;')
cursor.execute(f"SET SEED TO {seed};")

query = """
WITH random_crawl AS (
SELECT id
FROM crawl
ORDER BY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this came up before; ordering randomly the table is expensive means you are fetching every row and then picking up one.

it's cheaper to pick a row between 1 and count(*) and use that with OFFSET combined with LIMIT:

https://www.postgresql.org/docs/current/queries-limit.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you told me about that, when I copy paste the query I forgot to change ORDER by OFFSET. It will be changed soon.

floor(random() * (
SELECT
COUNT(*)
FROM
Chunk
))
LIMIT
1
)
SELECT
cr.id AS crawl_id, cr.url AS crawl_url, sc.score, sc.score_type
FROM
crawl cr
INNER JOIN
score sc ON cr.id = sc.entity_id
WHERE
cr.id = (SELECT id FROM random_crawl)
"""

cursor.execute(query)
return cursor.fetchall()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EOF newline

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this should be validated by the repo standards check)

45 changes: 45 additions & 0 deletions bin/testing-current-score.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
import ailab.db as db

from ailab.db.finesse.test_queries import get_random_document_score


class NoChunkFoundError(Exception):
pass


## This is a comment.
def evaluate_random_document(project_db):

if project_db is None:
print("Database connection failed.")
return None

with project_db.cursor() as cursor:

random_chunk = get_random_document_score(cursor, "louis_v005")

if not random_chunk:
raise NoChunkFoundError("No chunk found in the database.")

print("\n-------------")
print("crawl_id:", random_chunk[0]["crawl_id"])
print("crawl_url:", random_chunk[0]["crawl_url"])
print("\n")

print("-------------")
for chunk in random_chunk:
print("score_type:", chunk["score_type"])
print("score:", chunk["score"])
print("\n")

return



def main():
project_db = db.connect_db()
evaluate_random_document(project_db)


if __name__ == "__main__":
main()
6 changes: 6 additions & 0 deletions bin/testing-current-score.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/bin/bash
DIRNAME=$(dirname "$0")
. "$DIRNAME"/lib.sh


PYTHONPATH=$PROJECT_DIR python "$DIRNAME"/testing-current-score.py
Loading