Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select json data from database #1791

Closed
wants to merge 9 commits into from
Closed

Conversation

dsblank
Copy link
Member

@dsblank dsblank commented Nov 16, 2024

This PR adds a generic method for querying the database. It has a pure-python implementation (in case some database backend cannot implement it, or for testing, but not designed for real use) and a DB-API implementation. The pure-python version could be removed from this PR.

Motivation

As we create an optimized filter API, we will need to create methods for querying the data using the power of the underlying database (such as SQL, or MongoDB). Some of these methods will require "business logic" methods that require JOINS or other more complicated queries.

However, a large number of queries are simple and can be implemented by simply querying the JSON data in a table.

Example

There are many examples in the 274 filter rules. Here is one that is used often: _hastag.py. This is used to determine if an object contains a tag.

    def apply(self, db, obj):
        """
        Apply the rule.  Return True for a match.
        """
        if self.tag_handle is None:
            return False
        return self.tag_handle in obj.get_tag_list()

Currently, it is designed to examine every object to see if the tag_handle is in the tag_list.

However, that can be re-written as:

    def prepare(self, db, user):
        self.tag_handle = None
        tag = db.get_tag_from_name(self.list[0])
        if tag is not None:
            self.tag_handle = tag.get_handle()
            results = db.select(self.table, ["$.handle"], ("$.tag_list", "LIKE", f'%"{self.tag_handle}"%'))
            self.map = set([row["handle"] for row in list(results)])
        else:
            self.map = set()

    def apply_to_one(self, db, data):
        return data["handle"] in self.map

The key line is:

db.select(self.table, ["$.handle"], ("$.tag_list", "LIKE", f'%"{self.tag_handle}"%'))

Decision

The choice is:

  1. Write specific methods for each of these simple selects
  2. Add a single method, db.select(), that will prevent having to write dozens of one-line methods

We need to make this decision before touching the 274 rules for the filter refactor/optimize PR.

@dsblank dsblank self-assigned this Nov 16, 2024
@dsblank dsblank requested a review from Nick-Hall November 16, 2024 13:31
"""
selections = selections if selections else ["$"]
if page_size is None:
limit = float("+inf")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
limit = float("+inf")
limit = sys.maxsize

Could you use sys.maxsize instead of converting a string to a float?

"""
Evaluate the where expression.
"""
if op == "=":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be clearer to use match \ case instead of if \ elif?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's Python 3.10+.

:param where: A single where-expression (see below)
:type where: tuple or list
:param sort_by: A list of expressions to sort on
:type where: tuple or list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:type where: tuple or list
:type sort_by: tuple or list

:param page: The page number to return (zero-based)
:type page: int
:param page_size: The size of a page in rows; None means ignore
:type page: int or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:type page: int or None
:type page_size: int or None

:param page: The page number to return (zero-based)
:type page: int
:param page_size: The size of a page in rows; None means ignore
:type page: int or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:type page: int or None
:type page_size: int or None

:param page: The page number to return (zero-based)
:type page: int
:param page_size: The size of a page in rows; None means ignore
:type page: int or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:type page: int or None
:type page_size: int or None

@dsblank
Copy link
Member Author

dsblank commented Nov 18, 2024

@Nick-Hall, before we invest too much time in reviewing and addressing issues, we need a couple of questions answered:

  1. Can we add db.select() to the database interface? If we don't, we we're going to have write many one-liners, and custom filter code devs won't have access to any SQL selections needed for the optimization (next PR) until it is added to the db layer.
  2. If we do add db.select() do we need to keep the Pure-Python version. I'm ok either way.

@stevenyoungs
Copy link
Contributor

If I read the code correctly, the current implementation of db.select() is is limited to a single table.
Is there a way to extend this to support multi-table queries e.g. get all the women who were born in 1850
Your PR will improve performance for this query but we still end up bringing more rows into memory than if we ran the entire query within the DB.
Without thinking about it too hard, you'd bring rows for all births in 1850 and all women and intersect the two results in python.

Any query would still need transforming from "gramps SQL" into the SQL dialect used by the DB given gramps support of different DBs and their differing syntax, especially for querying JSON.

@dsblank
Copy link
Member Author

dsblank commented Nov 19, 2024

If I read the code correctly, the current implementation of db.select() is is limited to a single table.

Yes, that is correct. I thought that this was a compromise to add just a little bit that can be converted into SQL, but also can be run without it if needed (Pure-Python version). There has been a lot of debate over this issue over the years, and I didn't want to wade into that.

Is there a way to extend this to support multi-table queries e.g. get all the women who were born in 1850

The line I am drawing is that if you need a JOIN, then you should write a "business logic" method.

Your PR will improve performance for this query but we still end up bringing more rows into memory than if we ran the entire query within the DB. Without thinking about it too hard, you'd bring rows for all births in 1850 and all women and intersect the two results in python.

As long as the selected handles are less than the total rows, then it will be faster, but at the expensive of some memory. We can actually make a decision in the rule.prepare() to decide to not make the rule.map if it is too big.

Any query would still need transforming from "gramps SQL" into the SQL dialect used by the DB given gramps support of different DBs and their differing syntax, especially for querying JSON.

Yes, another reason to keep this simple: it will be overloaded in MongoDb, etc.

@Call-Me-Dave
Copy link

Would MessagePack be useful?

It's like JSON.
but fast and small.

MessagePack is an efficient binary serialization format. It lets you exchange data among multiple languages like JSON. But it's faster and smaller. Small integers are encoded into a single byte, and typical short strings require only one extra byte in addition to the strings themselves.

https://msgpack.org/index.html

@dsblank
Copy link
Member Author

dsblank commented Nov 21, 2024

Would MessagePack be useful?

Maybe for some functions, but I don't think anything related to the latest work on Gramps. JSON is useful because it is self-documenting, allows the database to be used outside of Gramps, and can be directly queried by SQL.

@dsblank
Copy link
Member Author

dsblank commented Nov 22, 2024

I'm going to close this for now, as I don't want SQL issues getting in the way of moving forward with the filter fixes.

@dsblank dsblank closed this Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants