Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Domain transformations in batches for less memory use #5218

Merged
merged 13 commits into from
Mar 22, 2021

Conversation

markotoplak
Copy link
Member

Issue

Orange does domain transformations all the time. They are most obvious in Test Learner or Predict, if you pass test data in a different domain that the learner supports, but they are really everywhere. For example, all preprocessing is based on domain transformations.

Domain transformation can be chained. As currently implemented, each of these transformations is performed the whole data table and produces an intermediate table of the same length. All these intermediate tables count towards Orange's memory use.

Description of changes

This PR makes domain transformations work in batches. Essentially, space complexity of domain transformation, that used to be O(rows), will drop to O(1). In the future, we could easily add callbacks to from_table to get progress (some different PR). Also, different parts could be handled by different threads.

What is a good batch size? I do not know, but according to my tests with mocked feature transformations a size of 5000 does not slow anything down. Even better: the transformation are now sometimes even faster. I do not know why, but perhaps it is due to less memory allocations/deallocations with the new branch. There is still some overhead that I haven't been able to squeeze out (also, with smaller parts we lose some numpy efficiency), so for now, 5000 it is.

I tried to make history clean to separate the functional changes from refactoring. Refactoring from_table is probably the largest change in this PR.

Here is memory use (MB) in time for the script below. The red line is for master, the green line for this branch with PART=5000. The task was normalization and PCA. Please ignore the obviously inefficient normalization and focus on the rightmost part of the graph. There we can see that the final transformation into the PCA domain required additional 2000 MB for master, while for this branch we do not see any additional memory use.

myplot

Why is the total memory use at the end smaller for the new branch? It theory, it should not be, but Python does not release memory to the OS (it aims to recycle it within its process) except for large continuous allocations.

For now I set part size to 10 and ran tests, and then set it to 5000 and reran them. Both passed, but we should somehow continuously test for very small part sizes. Suggestions how to do it elegantly are welcome!

The test script:

import os, psutil
import time
from threading import Thread

import numpy as np

from Orange.data import Table, Domain, ContinuousVariable
from Orange.preprocess import Normalize
from Orange.projection import PCA

def setup_dense(rows, cols):
    return Table.from_numpy(  # pylint: disable=W0201
        Domain([ContinuousVariable(str(i)) for i in range(cols)]),
        np.random.RandomState(0).rand(rows, cols))

class MemUse(Thread):
    def __init__(self):
        Thread.__init__(self)
        self.daemon = True
        self.measures = []
        self.last_resmem = 0
        self.start()
        self.stime = time.time()

    def run(self):
        while True:
            process = psutil.Process(os.getpid())
            resmem = process.memory_info().rss / (1024 * 1024)
            self.measures.append((time.time() - self.stime, resmem))
            if int(resmem) != int(self.last_resmem):
                print("t %8.1f     resmem %8d" % (time.time() - self.stime, int(resmem)))
            self.last_resmem = resmem
            time.sleep(0.5)

def wait(msg, sec=2):
    import gc
    gc.collect()
    gc.collect()
    print("WAIT " + msg)
    time.sleep(sec)
    gc.collect()

mem = MemUse()

data = setup_dense(100000, 1000)  # 800MB
wait("loaded data")

data2 = Normalize()(data)
wait("after normalize")

data2 = PCA(n_components=5)(data2)(data2)
wait("after pca")

tdomain = data2.domain
del data2
wait("after del")

data.transform(tdomain)
wait("after del")

print(mem.measures)
Includes
  • Code changes
  • Tests ?
  • Documentation

@codecov
Copy link

codecov bot commented Jan 28, 2021

Codecov Report

Merging #5218 (ebf5e47) into master (3a1168f) will increase coverage by 0.01%.
The diff coverage is 85.16%.

@@            Coverage Diff             @@
##           master    #5218      +/-   ##
==========================================
+ Coverage   86.26%   86.28%   +0.01%     
==========================================
  Files         301      301              
  Lines       61120    61206      +86     
==========================================
+ Hits        52725    52809      +84     
- Misses       8395     8397       +2     

@janezd janezd self-assigned this Jan 29, 2021
Copy link
Contributor

@janezd janezd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markotoplak, although you nicely split this into commits, this is one of those PRs for which it's difficult to foresee all consequences and potentially forgotten scenarios.

I saw what you've done and agree with the idea, but can't say whether you forgot anything. @lanzagar is much better at this. Therefore I suggest @lanzagar reviews it, or we merge it immediately after release so bugs we'll have some time to appear.


def _idcache_restore(cachedict, keys):
# key is tuple(list) not tuple(genexpr) for speed
shared, weakrefs = cachedict.get(tuple([id(k) for k in keys]), (None, []))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can omit creating a list, tuple(id(k) for k in keys), or even use tuple(map(id, keys)). Same in the above function.

This is, of course, an extremely imporant comment.

Copy link
Member Author

@markotoplak markotoplak Mar 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, at first I had tuple(id(k) for k in keys), but that measured as significantly slower than tuple([id(k) for k in keys]). I have no idea why though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started falling into a black hole of Python source, but caught myself and stopped. This is very interesting and perhaps worth reporting if reproducable in a simple case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tuple(map(id, keys))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


l = ["a", "b", "c"]

N = 1000000

t = time.time()
for i in range(N):
    tuple([id(a) for a in l])
print("tuple(list)", time.time() - t)

t = time.time()
for i in range(N):
    tuple(map(id, l))
print("tuple(map)", time.time() - t)

t = time.time()
for i in range(N):
    tuple(id(a) for a in l)
print("tuple(genexpr)", time.time() - t)

With Python 3.8 on Linux I get this (times can vary a bit, but usually the differences are similar).

tuple(list) 0.3450343608856201
tuple(map) 0.32802605628967285
tuple(genexpr) 0.4420166015625

So yes, I'll change to maps.

@janezd janezd removed their assignment Jan 29, 2021
@markotoplak markotoplak changed the title [ENH][RFC] Domain transformations in batches for less memory use [ENH] Domain transformations in batches for less memory use Feb 1, 2021
@markotoplak markotoplak added the merge after release Potentially unstable and needs to be tested well. label Feb 1, 2021
@irgolic irgolic removed the merge after release Potentially unstable and needs to be tested well. label Mar 12, 2021
@markotoplak
Copy link
Member Author

/rebase

@lanzagar lanzagar merged commit a8b3c05 into biolab:master Mar 22, 2021
@markotoplak markotoplak deleted the from-table-progress branch November 25, 2021 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants