Select json data from database #1791

dsblank · 2024-11-16T13:25:33Z

This PR adds a generic method for querying the database. It has a pure-python implementation (in case some database backend cannot implement it, or for testing, but not designed for real use) and a DB-API implementation. The pure-python version could be removed from this PR.

Motivation

As we create an optimized filter API, we will need to create methods for querying the data using the power of the underlying database (such as SQL, or MongoDB). Some of these methods will require "business logic" methods that require JOINS or other more complicated queries.

However, a large number of queries are simple and can be implemented by simply querying the JSON data in a table.

Example

There are many examples in the 274 filter rules. Here is one that is used often: _hastag.py. This is used to determine if an object contains a tag.

    def apply(self, db, obj):
        """
        Apply the rule.  Return True for a match.
        """
        if self.tag_handle is None:
            return False
        return self.tag_handle in obj.get_tag_list()

Currently, it is designed to examine every object to see if the tag_handle is in the tag_list.

However, that can be re-written as:

    def prepare(self, db, user):
        self.tag_handle = None
        tag = db.get_tag_from_name(self.list[0])
        if tag is not None:
            self.tag_handle = tag.get_handle()
            results = db.select(self.table, ["$.handle"], ("$.tag_list", "LIKE", f'%"{self.tag_handle}"%'))
            self.map = set([row["handle"] for row in list(results)])
        else:
            self.map = set()

    def apply_to_one(self, db, data):
        return data["handle"] in self.map

The key line is:

db.select(self.table, ["$.handle"], ("$.tag_list", "LIKE", f'%"{self.tag_handle}"%'))

Decision

The choice is:

Write specific methods for each of these simple selects
Add a single method, db.select(), that will prevent having to write dozens of one-line methods

We need to make this decision before touching the 274 rules for the filter refactor/optimize PR.

stevenyoungs · 2024-11-18T13:07:41Z

gramps/gen/db/select_utils.py

+    """
+    selections = selections if selections else ["$"]
+    if page_size is None:
+        limit = float("+inf")


Suggested change

limit = float("+inf")

limit = sys.maxsize

Could you use sys.maxsize instead of converting a string to a float?

stevenyoungs · 2024-11-18T18:50:23Z

gramps/gen/db/select_utils.py

+    """
+    Evaluate the where expression.
+    """
+    if op == "=":


would it be clearer to use match \ case instead of if \ elif?

That's Python 3.10+.

stevenyoungs · 2024-11-18T18:52:40Z

gramps/plugins/db/dbapi/dbapi.py

+        :param where: A single where-expression (see below)
+        :type where: tuple or list
+        :param sort_by: A list of expressions to sort on
+        :type where: tuple or list


Suggested change

:type where: tuple or list

:type sort_by: tuple or list

stevenyoungs · 2024-11-18T18:52:56Z

gramps/plugins/db/dbapi/dbapi.py

+        :param page: The page number to return (zero-based)
+        :type page: int
+        :param page_size: The size of a page in rows; None means ignore
+        :type page: int or None


Suggested change

:type page: int or None

:type page_size: int or None

stevenyoungs · 2024-11-18T19:02:55Z

gramps/gen/db/generic.py

+        :param page: The page number to return (zero-based)
+        :type page: int
+        :param page_size: The size of a page in rows; None means ignore
+        :type page: int or None


Suggested change

:type page: int or None

:type page_size: int or None

stevenyoungs · 2024-11-18T19:03:15Z

gramps/gen/db/select_utils.py

+    :param page: The page number to return (zero-based)
+    :type page: int
+    :param page_size: The size of a page in rows; None means ignore
+    :type page: int or None


Suggested change

:type page: int or None

:type page_size: int or None

dsblank · 2024-11-18T20:46:06Z

@Nick-Hall, before we invest too much time in reviewing and addressing issues, we need a couple of questions answered:

Can we add db.select() to the database interface? If we don't, we we're going to have write many one-liners, and custom filter code devs won't have access to any SQL selections needed for the optimization (next PR) until it is added to the db layer.
If we do add db.select() do we need to keep the Pure-Python version. I'm ok either way.

stevenyoungs · 2024-11-18T22:34:23Z

If I read the code correctly, the current implementation of db.select() is is limited to a single table.
Is there a way to extend this to support multi-table queries e.g. get all the women who were born in 1850
Your PR will improve performance for this query but we still end up bringing more rows into memory than if we ran the entire query within the DB.
Without thinking about it too hard, you'd bring rows for all births in 1850 and all women and intersect the two results in python.

Any query would still need transforming from "gramps SQL" into the SQL dialect used by the DB given gramps support of different DBs and their differing syntax, especially for querying JSON.

dsblank · 2024-11-19T00:16:20Z

If I read the code correctly, the current implementation of db.select() is is limited to a single table.

Yes, that is correct. I thought that this was a compromise to add just a little bit that can be converted into SQL, but also can be run without it if needed (Pure-Python version). There has been a lot of debate over this issue over the years, and I didn't want to wade into that.

Is there a way to extend this to support multi-table queries e.g. get all the women who were born in 1850

The line I am drawing is that if you need a JOIN, then you should write a "business logic" method.

Your PR will improve performance for this query but we still end up bringing more rows into memory than if we ran the entire query within the DB. Without thinking about it too hard, you'd bring rows for all births in 1850 and all women and intersect the two results in python.

As long as the selected handles are less than the total rows, then it will be faster, but at the expensive of some memory. We can actually make a decision in the rule.prepare() to decide to not make the rule.map if it is too big.

Any query would still need transforming from "gramps SQL" into the SQL dialect used by the DB given gramps support of different DBs and their differing syntax, especially for querying JSON.

Yes, another reason to keep this simple: it will be overloaded in MongoDb, etc.

Call-Me-Dave · 2024-11-20T19:57:30Z

Would MessagePack be useful?

It's like JSON.
but fast and small.

MessagePack is an efficient binary serialization format. It lets you exchange data among multiple languages like JSON. But it's faster and smaller. Small integers are encoded into a single byte, and typical short strings require only one extra byte in addition to the strings themselves.

https://msgpack.org/index.html

dsblank · 2024-11-21T01:24:10Z

Would MessagePack be useful?

Maybe for some functions, but I don't think anything related to the latest work on Gramps. JSON is useful because it is self-documenting, allows the database to be used outside of Gramps, and can be directly queried by SQL.

dsblank · 2024-11-22T19:02:03Z

I'm going to close this for now, as I don't want SQL issues getting in the way of moving forward with the filter fixes.

dsblank added 9 commits November 16, 2024 07:59

Initial version of DbGeneric.select()

a85aab0

Added docstrings and more operators

a136d3b

Implementation in SQL

9a0929c

Added page, page_size; refactor for better SQL variations

602609a

Linting

9b172b7

Fixed a bug in indexable where clause; change page_size default

6f7882e

((lhs, 'and', rhs), 'or' (lhs, 'and', rhs))

8ccb896

Properly replace values with ?

7ea15f5

Linting

0930570

dsblank self-assigned this Nov 16, 2024

dsblank added the enhancement label Nov 16, 2024

dsblank requested a review from Nick-Hall November 16, 2024 13:31

stevenyoungs reviewed Nov 18, 2024

View reviewed changes

Nick-Hall mentioned this pull request Nov 21, 2024

Switch from pickled blobs to JSON data #1786

Merged

dsblank closed this Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Select json data from database #1791

Select json data from database #1791

dsblank commented Nov 16, 2024

stevenyoungs Nov 18, 2024

stevenyoungs Nov 18, 2024

QuLogic Nov 18, 2024

stevenyoungs Nov 18, 2024

stevenyoungs Nov 18, 2024

stevenyoungs Nov 18, 2024

stevenyoungs Nov 18, 2024

dsblank commented Nov 18, 2024

stevenyoungs commented Nov 18, 2024

dsblank commented Nov 19, 2024

Call-Me-Dave commented Nov 20, 2024

dsblank commented Nov 21, 2024

dsblank commented Nov 22, 2024

Select json data from database #1791

Select json data from database #1791

Conversation

dsblank commented Nov 16, 2024

Motivation

Example

Decision

stevenyoungs Nov 18, 2024

Choose a reason for hiding this comment

stevenyoungs Nov 18, 2024

Choose a reason for hiding this comment

QuLogic Nov 18, 2024

Choose a reason for hiding this comment

stevenyoungs Nov 18, 2024

Choose a reason for hiding this comment

stevenyoungs Nov 18, 2024

Choose a reason for hiding this comment

stevenyoungs Nov 18, 2024

Choose a reason for hiding this comment

stevenyoungs Nov 18, 2024

Choose a reason for hiding this comment

dsblank commented Nov 18, 2024

stevenyoungs commented Nov 18, 2024

dsblank commented Nov 19, 2024

Call-Me-Dave commented Nov 20, 2024

dsblank commented Nov 21, 2024

dsblank commented Nov 22, 2024