-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Domain transformations in batches for less memory use #5218
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5218 +/- ##
==========================================
+ Coverage 86.26% 86.28% +0.01%
==========================================
Files 301 301
Lines 61120 61206 +86
==========================================
+ Hits 52725 52809 +84
- Misses 8395 8397 +2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markotoplak, although you nicely split this into commits, this is one of those PRs for which it's difficult to foresee all consequences and potentially forgotten scenarios.
I saw what you've done and agree with the idea, but can't say whether you forgot anything. @lanzagar is much better at this. Therefore I suggest @lanzagar reviews it, or we merge it immediately after release so bugs we'll have some time to appear.
Orange/data/table.py
Outdated
|
||
def _idcache_restore(cachedict, keys): | ||
# key is tuple(list) not tuple(genexpr) for speed | ||
shared, weakrefs = cachedict.get(tuple([id(k) for k in keys]), (None, [])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can omit creating a list, tuple(id(k) for k in keys)
, or even use tuple(map(id, keys))
. Same in the above function.
This is, of course, an extremely imporant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, at first I had tuple(id(k) for k in keys)
, but that measured as significantly slower than tuple([id(k) for k in keys])
. I have no idea why though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started falling into a black hole of Python source, but caught myself and stopped. This is very interesting and perhaps worth reporting if reproducable in a simple case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tuple(map(id, keys))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
l = ["a", "b", "c"]
N = 1000000
t = time.time()
for i in range(N):
tuple([id(a) for a in l])
print("tuple(list)", time.time() - t)
t = time.time()
for i in range(N):
tuple(map(id, l))
print("tuple(map)", time.time() - t)
t = time.time()
for i in range(N):
tuple(id(a) for a in l)
print("tuple(genexpr)", time.time() - t)
With Python 3.8 on Linux I get this (times can vary a bit, but usually the differences are similar).
tuple(list) 0.3450343608856201
tuple(map) 0.32802605628967285
tuple(genexpr) 0.4420166015625
So yes, I'll change to maps.
/rebase |
05df183
to
489d232
Compare
Domain conversion in batches, which saves memory for intermediate results. Information about domain conversion is catched and reused within all parts.
This allows better caching of domain transformations in from_table
Table.from_table_rows is now needed within get_columns
489d232
to
7d8e1fd
Compare
Issue
Orange does domain transformations all the time. They are most obvious in Test Learner or Predict, if you pass test data in a different domain that the learner supports, but they are really everywhere. For example, all preprocessing is based on domain transformations.
Domain transformation can be chained. As currently implemented, each of these transformations is performed the whole data table and produces an intermediate table of the same length. All these intermediate tables count towards Orange's memory use.
Description of changes
This PR makes domain transformations work in batches. Essentially, space complexity of domain transformation, that used to be O(rows), will drop to O(1). In the future, we could easily add callbacks to from_table to get progress (some different PR). Also, different parts could be handled by different threads.
What is a good batch size? I do not know, but according to my tests with mocked feature transformations a size of 5000 does not slow anything down. Even better: the transformation are now sometimes even faster. I do not know why, but perhaps it is due to less memory allocations/deallocations with the new branch. There is still some overhead that I haven't been able to squeeze out (also, with smaller parts we lose some numpy efficiency), so for now, 5000 it is.
I tried to make history clean to separate the functional changes from refactoring. Refactoring from_table is probably the largest change in this PR.
Here is memory use (MB) in time for the script below. The red line is for master, the green line for this branch with PART=5000. The task was normalization and PCA. Please ignore the obviously inefficient normalization and focus on the rightmost part of the graph. There we can see that the final transformation into the PCA domain required additional 2000 MB for master, while for this branch we do not see any additional memory use.
Why is the total memory use at the end smaller for the new branch? It theory, it should not be, but Python does not release memory to the OS (it aims to recycle it within its process) except for large continuous allocations.
For now I set part size to 10 and ran tests, and then set it to 5000 and reran them. Both passed, but we should somehow continuously test for very small part sizes. Suggestions how to do it elegantly are welcome!
The test script:
Includes