Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select json data from database #1791

Open
wants to merge 9 commits into
base: dsb/depickle
Choose a base branch
from
Open

Conversation

dsblank
Copy link
Member

@dsblank dsblank commented Nov 16, 2024

This PR adds a generic method for querying the database. It has a pure-python implementation (in case some database backend cannot implement it, or for testing, but not designed for real use) and a DB-API implementation. The pure-python version could be removed from this PR.

Motivation

As we create an optimized filter API, we will need to create methods for querying the data using the power of the underlying database (such as SQL, or MongoDB). Some of these methods will require "business logic" methods that require JOINS or other more complicated queries.

However, a large number of queries are simple and can be implemented by simply querying the JSON data in a table.

Example

There are many examples in the 274 filter rules. Here is one that is used often: _hastag.py. This is used to determine if an object contains a tag.

    def apply(self, db, obj):
        """
        Apply the rule.  Return True for a match.
        """
        if self.tag_handle is None:
            return False
        return self.tag_handle in obj.get_tag_list()

Currently, it is designed to examine every object to see if the tag_handle is in the tag_list.

However, that can be re-written as:

    def prepare(self, db, user):
        self.tag_handle = None
        tag = db.get_tag_from_name(self.list[0])
        if tag is not None:
            self.tag_handle = tag.get_handle()
            results = db.select(self.table, ["$.handle"], ("$.tag_list", "LIKE", f'%"{self.tag_handle}"%'))
            self.map = set([row["handle"] for row in list(results)])
        else:
            self.map = set()

    def apply_to_one(self, db, data):
        return data["handle"] in self.map

The key line is:

db.select(self.table, ["$.handle"], ("$.tag_list", "LIKE", f'%"{self.tag_handle}"%'))

Decision

The choice is:

  1. Write specific methods for each of these simple selects
  2. Add a single method, db.select(), that will prevent having to write dozens of one-line methods

We need to make this decision before touching the 274 rules for the filter refactor/optimize PR.

@dsblank dsblank self-assigned this Nov 16, 2024
"""
selections = selections if selections else ["$"]
if page_size is None:
limit = float("+inf")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
limit = float("+inf")
limit = sys.maxsize

Could you use sys.maxsize instead of converting a string to a float?

"""
Evaluate the where expression.
"""
if op == "=":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be clearer to use match \ case instead of if \ elif?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's Python 3.10+.

:param where: A single where-expression (see below)
:type where: tuple or list
:param sort_by: A list of expressions to sort on
:type where: tuple or list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:type where: tuple or list
:type sort_by: tuple or list

:param page: The page number to return (zero-based)
:type page: int
:param page_size: The size of a page in rows; None means ignore
:type page: int or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:type page: int or None
:type page_size: int or None

:param page: The page number to return (zero-based)
:type page: int
:param page_size: The size of a page in rows; None means ignore
:type page: int or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:type page: int or None
:type page_size: int or None

:param page: The page number to return (zero-based)
:type page: int
:param page_size: The size of a page in rows; None means ignore
:type page: int or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:type page: int or None
:type page_size: int or None

@dsblank
Copy link
Member Author

dsblank commented Nov 18, 2024

@Nick-Hall, before we invest too much time in reviewing and addressing issues, we need a couple of questions answered:

  1. Can we add db.select() to the database interface? If we don't, we we're going to have write many one-liners, and custom filter code devs won't have access to any SQL selections needed for the optimization (next PR) until it is added to the db layer.
  2. If we do add db.select() do we need to keep the Pure-Python version. I'm ok either way.

@stevenyoungs
Copy link
Contributor

If I read the code correctly, the current implementation of db.select() is is limited to a single table.
Is there a way to extend this to support multi-table queries e.g. get all the women who were born in 1850
Your PR will improve performance for this query but we still end up bringing more rows into memory than if we ran the entire query within the DB.
Without thinking about it too hard, you'd bring rows for all births in 1850 and all women and intersect the two results in python.

Any query would still need transforming from "gramps SQL" into the SQL dialect used by the DB given gramps support of different DBs and their differing syntax, especially for querying JSON.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants