-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Queries do not scale well when table grows in size #77
Comments
Thanks! I haven't yet had the problem where trees became so big that the performance suffered too much. I have a list of interesting links stashed away, e.g. I suspect that converting the
The queries aren't equivalent at all though?
The docs are lacking in this area, but |
Interesting links, thanks! Coincidentally, I think the first link has a query that is similar to one that our developers wrote (if you exclude the path and depth info): with recursive tree (id) as (
-- Base case
SELECT id
from storage_directory
where id = 104113
UNION ALL
-- Rescurse
select sd.id
from storage_directory sd
join tree on sd.parent_id = tree.id
) select count(*) from tree; We're really only interested in getting the descendants / ancestors, and for us it would be completely fine to write these queries on our own as 2 different queries, rather than try and make one query that works for both cases. So maybe that's what we're going to do, maybe with something like django-cte, as a last resort unless we can fix these performance problems. I suspected the queries weren't equivalent, but we did get back the same rows in both of our tests where we selected a node with many descendants (a non-parent node) in both queries. Big disclaimer tho; I haven't used recursive CTE's before, so this is all based on some rudimentary testing. But my concern still remains; the original query constructs every single possible tree in the table on every query, instead of just the one you're targeting. |
Ah, you're right.
The |
Thinking out loud, I wonder if we could make an easy, non-intrusive fix just for this case, or if we should better pursue the materialized view optimization for very large trees or for read-heavy workloads. I lean towards the latter, but the perfect is also the enemy of the good. I hope we can avoid adding many small band-aid fixes when we could fix the performance issue once for all. |
Our monitoring is observing that I forked your repo and was about to do the exact change you said, with the WHERE condition. But I got stuck since I am not familiar with custom SQLCompilers. I liked the fact that this library doesn't require building up any pre-existing structure that had to be maintained, like your previous |
Thanks for profiling this! Then it sounds like small fixes here and there would work well. |
We wrote these two queries, which from first glance works. We'd need to integrate it into the platform and have our tests run before confirming it though. Descendants: with recursive tree (id, depth, path, sd) as (
-- Base case
select
id,
0 as depth,
array[id] as path,
*
from storage_directory
where id = 103771
UNION ALL
-- Recurse
select
sd.id,
tree.depth+1,
tree.path || sd.id,
sd.*
from storage_directory sd
join tree on sd.parent_id = tree.id
)
select tree.depth, tree.id, tree.path, tree.name
from tree
order by (tree.depth, tree.id); Explain:
Ancestors: with recursive tree (id, steps, sd) as (
-- Base case
select
id,
0 as steps,
*
from storage_directory
where id = 137811
UNION ALL
-- Recurse
select
sd.id,
tree.steps+1,
sd.*
from storage_directory sd
join tree on sd.id = tree.parent_id
)
select tree.id, tree.parent_id, tree.steps, tree.name
from tree
order by steps DESC; Explain:
Both of these are performing really well according to the explain queries. We will try running these in a database with 100k+ rows. One thing to note is that this drops the fact that descendants The ordering is very much simplified by using |
Compare the above with EXPLAIN on the original descendants query:
This does 2 Seq scans, which has to go through every single row in the table twice, most likely also has to do disk reads because these might not be in memory. Then it has to order the rank table, which contains every single row in the table. The CTE scan is probably in-memory, since it does a Seq scan on the materialized result of the CTE, but it still contains every single row in the table. |
Sorry for the spam (let me know if all this info is overwhelming you!); but I wanted to let you know that we managed to implement the descendants query via def descendants(self, include_self=False, **kwargs):
"""
Returns all descendants of the current node
"""
if not self.pk:
raise ValueError("Cannot get descendants of an unsaved node")
def make_cte(cte):
# non-recursive: get root nodes
return (
Directory.objects.filter(id=self.pk)
.annotate(
path=RawSQL("array[id]", ()),
depth=Value(0, output_field=IntegerField()),
)
.union(
# recursive union: get descendants
cte.join(Directory, parent_id=cte.col.id).annotate(
path=Func(
cte.col.path,
F("id"),
function="array_append",
output_field=ArrayField(IntegerField()),
),
depth=cte.col.depth + Value(1, output_field=IntegerField()),
),
all=True,
)
)
cte = With.recursive(make_cte)
qs = cte.queryset().with_cte(cte).order_by(cte.col.depth, cte.col.id)
if not include_self:
qs = qs.exclude(id=self.pk)
return qs I'm wondering; could this library be simplified if it leveraged |
When I started out with this, django-cte wasn't advanced enough and I needed/wanted something simple quickly. I can very well imagine that django-cte could be revisited, either as an alternative or as a base. It still doesn't feel too good since it's maintained somewhat actively, but not too actively -- the test suite only tests Django 4.2, nothing newer.
Yes, this could be a good plan. It really depends on the complexity tax we have to pay in django-tree-queries. As you say, use cases up to a few hundred or thousand nodes do not need the optimization. Larger trees do need it, but I'm a bit selfish here and do not want to pay too much of this tax myself since the solution doesn't matter too much to my use cases. The tree building definitely appears in traces though and it would probably make sense anyway.
Not at all, it's interesting! I appreciate you spending the time explaining all of this and bringing new ideas to the table. |
For what it's worth, I have a table with about 400k entries and found that the performance of ancestors() and descendants() has dropped significantly. I really liked the simplicity of this library, but the performance issues are starting to show. @AlexanderArvidsson do you have a working snippet for |
I'm sorry to hear that. Can you pinpoint the change which lead to this regression? Probably the introduction of the rank table? I thought allowing more types of ordering was a good idea, but maybe it wasn't! There's certainly a way to make this work again, and if not, removing the offending features would be worth a thought certainly. |
I'm on version |
Yes, I do. We just put this as a method on the django model. No caveats so far, it is doing very well for us and we have a lot of rows. def ancestors(self, of=None, include_self=False):
"""
Returns all ancestors of the current node, in order of their depth relative to the root
"""
if not self.pk:
raise ValueError("Cannot get ancestors of an unsaved node")
def make_cte(cte):
# non-recursive: get current node
return (
Directory.objects.filter(id=pk(of or self.pk))
.annotate(
_steps=Value(0, output_field=IntegerField()),
)
.union(
# recursive union: get ancestors
cte.join(Directory, id=cte.col.parent_id).annotate(
_steps=cte.col._steps + Value(1, output_field=IntegerField()),
),
all=True,
)
)
cte = With.recursive(make_cte)
qs = cte.queryset().with_cte(cte).order_by(cte.col._steps.desc())
if not include_self:
qs = qs.exclude(id=self.pk)
return qs Since I reported this issue, we've unfortunately had to ditch @matthiask Don't let any of this discourage you (you shouldn't apologize, your use-case just wasn't the same as ours), I'm grateful for this library and your previous |
First of all, great library, I love the simplicity! But I think there's a big scaling problem here:
From reading how the query works, and analyzing it, it seems that (correct me if I'm wrong), it goes through every root in the entire table, creates the trees, then selects descendants or ancestors from that.
For example, descendants is selected by finding any row that includes the parent in the
__tree.tree_path
.In our case, we tested performance with a table that was 100k+ rows with deeply nested rows.
We had one tree with about 150k nested rows, with the rest of the table being very small trees at just a couple of hundred. Because this one tree existed, it affected performance of even empty trees.
This query took several seconds to complete, even when selecting a parent that had literally no children at all. That's because, regardless of what parent is selected, the query will still calculate every single tree in the entire table.
By making a simple change to the base select condition:
and removing the ANY check:
This query was reduced to a mere 50ms.
I know the above change is obviously not scalable and might very well just apply to this query, and it obviously doesn't work for ancestors method, but I am still questioning the decision to calculate every single tree in the entire table and then extracting the tree.
I'm wondering, is there a way to change this without drastic changes, so that it scales well with huge tables? For descendants, it just need to traverse down from the selected parent, and for ancestors it just needs to traverse up.
If tracking depth, path, etc is important, then maybe descendants could first traverse upwards until it finds the root node, then traverse down from there and select the subtree?
Thanks!
The text was updated successfully, but these errors were encountered: