-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize the performance of IN
operator for indexed properties
#8698
Comments
➤ PM Bot commented: Jira ticket: RCOCOA-2443 |
I am having difficulty reproducing this issue. We created a PersonClass with a number of properties including a
We then have a function that creates 100,000 PersonClass with the key property populated with "person" + an index from 0 to 99,999. e.g. the Realm file will contain PersonClass objects with key values of person0, person1, person2....person99999. We then use the code posted in your report, to query the
and the output to console is: Load completed of 31621 In the above test, an |
Ok, I'll see if I can hand you a better code snippet that repros this issue |
@Jaycyn Ok, I found what was missing - the class PersonClass: Object {
@Persisted(primaryKey: true) var _id: ObjectId
@Persisted(indexed: true) var key = ""
} this alone increases the query time from But, if you add another indexed string property and query with && like so: class PersonClass: Object {
@Persisted(primaryKey: true) var _id: ObjectId
@Persisted(indexed: true) var key1 = ""
@Persisted(indexed: true) var key2 = ""
}
realm.objects(PersonClass.self).where {
$0.key1.in(values) && $0.key2 == "<some_value>"
} ... the query takes several minutes to complete! |
The code I provided above has no indexing at all and as you can see completes in approximately 1.5s. While describing the issue is important, being able to duplicate it is equally important - that will help eliminate other issues; environmental, additional code running etc. Can you provide a minimal, duplicatable example? |
@Jaycyn that's the duplicatable example - add indexing to your sample and the performance would be much worse |
IN
operatorIN
operator for indexed properties
Hmm. I changed the key to being indexed
and the result is still an avg of 1.5s |
Ok, I'll just copy-paste the entire sample so we don't miss anything: // Object definition
class PersonClass: Object {
@Persisted(primaryKey: true) var _id: ObjectId
@Persisted(indexed: true) var key1 = ""
@Persisted(indexed: true) var key2 = ""
}
let num = 49999
// Object generation
let objects = (0...num).map { index in
let obj = PersonClass()
obj._id = .generate()
obj.key1 = "person\(index)"
obj.key2 = "salt\(index % 3)"
return obj
}
try! realm.write {
realm.add(objects, update: .all)
}
var values = [String]()
for _ in 0...num {
let randomNum = (0...num).randomElement() ?? num
let v = "person\(randomNum)"
values.append(v)
}
// Testing the query performance
if #available(iOS 16.0, *) {
let elapsedTime = ContinuousClock().measure {
print("performing query")
let results = realm.objects(PersonClass.self).where {
$0.key1.in(values) && $0.key2 == "salt0"
}
let total = realm.objects(PersonClass.self).count
print("Load completed of \(results.count) objects (\(total) total)")
}
print("\(elapsedTime)")
} |
Well, that's a very odd thing; I was able to duplicate the issue using that code. A few observations and more info: The issue only occurs when the object has two indexed properties and a query is performed using .where on one property using .in, and the query also includes && on the other property querying for a specific string. In other words, the query works perfect if it's one OR the other
Even splitting it up and performing two separate queries does not work
Note after the fact: this is not resolved to "separate queries" Realm combines them into one, so it's just one NSPredicate |
Update that due to realm being lazy-loading, the query is not actually executed until the print statement, and the issue is duplicated leaving that in so this original post was inaccurate and edited. --- below is incorrect ---
This is the issue
This code now works correctly
|
We have an optimized lookup for IN and chained OR EQUALS that constructs a set and does a table scan, and an optimized lookup to find rows which have a specific value in the index. Which one is faster depends on the size of the table, the number of values in the IN, and what portion of the table matches the query, so we have some heuristics there. I'm guessing that adding the second property is switching from the table scan to the index lookup, which in this case is slower. Without the |
I've run the query on these similar classes to find out how indexing is affecting the performance: class OneKeyNoIndex: Object {
@Persisted(primaryKey: true) var _id: ObjectId
@Persisted(indexed: false) var key1 = ""
}
class OneKeyIndexed: Object {
@Persisted(primaryKey: true) var _id: ObjectId
@Persisted(indexed: true) var key1 = ""
}
class TwoKeysNoIndex: Object {
@Persisted(primaryKey: true) var _id: ObjectId
@Persisted(indexed: false) var key1 = ""
@Persisted(indexed: false) var key2 = ""
}
class TwoKeysIndexed: Object {
@Persisted(primaryKey: true) var _id: ObjectId
@Persisted(indexed: true) var key1 = ""
@Persisted(indexed: true) var key2 = ""
}
class TwoKeysMixIndex1: Object {
@Persisted(primaryKey: true) var _id: ObjectId
@Persisted(indexed: false) var key1 = ""
@Persisted(indexed: true) var key2 = ""
}
class TwoKeysMixIndex2: Object {
@Persisted(primaryKey: true) var _id: ObjectId
@Persisted(indexed: true) var key1 = ""
@Persisted(indexed: false) var key2 = ""
} And here are the times: OneKeyNoIndex 0.238098 seconds
OneKeyIndexed 12.283518041999999 seconds
TwoKeysNoIndex 0.227876334 seconds
TwoKeysIndexed 122.770053084 seconds
TwoKeysMixIndex1 0.279000875 seconds
TwoKeysMixIndex2 66.472070625 seconds Observations:
There is clearly an issue with indexing, it dramatically slows things down! |
Ah, ah of course @tgoyne. We've not being using Realm as we migrate to another platform and forgot about how lazy it is! A great feature that really nobody else offers. However, that doesn't appear to account for both queries being very fast on their own, but not together... or does it? So on this
Our testing shows these two queries are roughly equivalent in performance. Running each query over 10 iterations has just minor difference: 1.72s average vs .03s average
The first being a |
Problem
The issue was first reported over 6 years ago on stack overflow, but is still actual for latest realm versions.
When you use the predicate
key IN %@
and supply a collection with a substantially large number of values (over a thousand), the filtering happens SO slowly that doing filtering outside the query element-by-element outperforms it by 50x, despite being inefficient on its own.The use case when this is needed:
We run a complex query on a large data set, which is done on a background thread so the UI isn't blocked. It passes a set of resulting objects' ids to the main thread, where we can quickly query objects by ids without blocking the UI.
Solution
It looks like the operator
<key> IN <collection>
is translated into<key> == value_1 OR <key> == value_2 OR ...
, which works well on small sets of values, but quickly degrades when the collection is large enough.I hope there is a way to optimize this part of the algorithm, at least when the supplied collection supports O(1) element presence check, like for
Set<String>
.Alternatives
There is really no alternative to this API other than hacks like supplying a subset of the collection for values, or filtering manually after the query
How important is this improvement for you?
Would be a major improvement
Feature would mainly be used with
Local Database only
The text was updated successfully, but these errors were encountered: