[ENH] Domain transformations in batches for less memory use #5218

markotoplak · 2021-01-28T22:40:21Z

Issue

Orange does domain transformations all the time. They are most obvious in Test Learner or Predict, if you pass test data in a different domain that the learner supports, but they are really everywhere. For example, all preprocessing is based on domain transformations.

Domain transformation can be chained. As currently implemented, each of these transformations is performed the whole data table and produces an intermediate table of the same length. All these intermediate tables count towards Orange's memory use.

Description of changes

This PR makes domain transformations work in batches. Essentially, space complexity of domain transformation, that used to be O(rows), will drop to O(1). In the future, we could easily add callbacks to from_table to get progress (some different PR). Also, different parts could be handled by different threads.

What is a good batch size? I do not know, but according to my tests with mocked feature transformations a size of 5000 does not slow anything down. Even better: the transformation are now sometimes even faster. I do not know why, but perhaps it is due to less memory allocations/deallocations with the new branch. There is still some overhead that I haven't been able to squeeze out (also, with smaller parts we lose some numpy efficiency), so for now, 5000 it is.

I tried to make history clean to separate the functional changes from refactoring. Refactoring from_table is probably the largest change in this PR.

Here is memory use (MB) in time for the script below. The red line is for master, the green line for this branch with PART=5000. The task was normalization and PCA. Please ignore the obviously inefficient normalization and focus on the rightmost part of the graph. There we can see that the final transformation into the PCA domain required additional 2000 MB for master, while for this branch we do not see any additional memory use.

Why is the total memory use at the end smaller for the new branch? It theory, it should not be, but Python does not release memory to the OS (it aims to recycle it within its process) except for large continuous allocations.

For now I set part size to 10 and ran tests, and then set it to 5000 and reran them. Both passed, but we should somehow continuously test for very small part sizes. Suggestions how to do it elegantly are welcome!

The test script:

import os, psutil
import time
from threading import Thread

import numpy as np

from Orange.data import Table, Domain, ContinuousVariable
from Orange.preprocess import Normalize
from Orange.projection import PCA

def setup_dense(rows, cols):
    return Table.from_numpy(  # pylint: disable=W0201
        Domain([ContinuousVariable(str(i)) for i in range(cols)]),
        np.random.RandomState(0).rand(rows, cols))

class MemUse(Thread):
    def __init__(self):
        Thread.__init__(self)
        self.daemon = True
        self.measures = []
        self.last_resmem = 0
        self.start()
        self.stime = time.time()

    def run(self):
        while True:
            process = psutil.Process(os.getpid())
            resmem = process.memory_info().rss / (1024 * 1024)
            self.measures.append((time.time() - self.stime, resmem))
            if int(resmem) != int(self.last_resmem):
                print("t %8.1f     resmem %8d" % (time.time() - self.stime, int(resmem)))
            self.last_resmem = resmem
            time.sleep(0.5)

def wait(msg, sec=2):
    import gc
    gc.collect()
    gc.collect()
    print("WAIT " + msg)
    time.sleep(sec)
    gc.collect()

mem = MemUse()

data = setup_dense(100000, 1000)  # 800MB
wait("loaded data")

data2 = Normalize()(data)
wait("after normalize")

data2 = PCA(n_components=5)(data2)(data2)
wait("after pca")

tdomain = data2.domain
del data2
wait("after del")

data.transform(tdomain)
wait("after del")

print(mem.measures)

Includes

Code changes
Tests ?
Documentation

codecov · 2021-01-28T22:53:00Z

Codecov Report

Merging #5218 (ebf5e47) into master (3a1168f) will increase coverage by 0.01%.
The diff coverage is 85.16%.

@@            Coverage Diff             @@
##           master    #5218      +/-   ##
==========================================
+ Coverage   86.26%   86.28%   +0.01%     
==========================================
  Files         301      301              
  Lines       61120    61206      +86     
==========================================
+ Hits        52725    52809      +84     
- Misses       8395     8397       +2

janezd

@markotoplak, although you nicely split this into commits, this is one of those PRs for which it's difficult to foresee all consequences and potentially forgotten scenarios.

I saw what you've done and agree with the idea, but can't say whether you forgot anything. @lanzagar is much better at this. Therefore I suggest @lanzagar reviews it, or we merge it immediately after release so bugs we'll have some time to appear.

janezd · 2021-01-29T10:26:32Z

Orange/data/table.py

+
+def _idcache_restore(cachedict, keys):
+    # key is tuple(list) not tuple(genexpr) for speed
+    shared, weakrefs = cachedict.get(tuple([id(k) for k in keys]), (None, []))


You can omit creating a list, tuple(id(k) for k in keys), or even use tuple(map(id, keys)). Same in the above function.

This is, of course, an extremely imporant comment.

Yes, at first I had tuple(id(k) for k in keys), but that measured as significantly slower than tuple([id(k) for k in keys]). I have no idea why though.

I started falling into a black hole of Python source, but caught myself and stopped. This is very interesting and perhaps worth reporting if reproducable in a simple case.

tuple(map(id, keys))

l = ["a", "b", "c"] N = 1000000 t = time.time() for i in range(N): tuple([id(a) for a in l]) print("tuple(list)", time.time() - t) t = time.time() for i in range(N): tuple(map(id, l)) print("tuple(map)", time.time() - t) t = time.time() for i in range(N): tuple(id(a) for a in l) print("tuple(genexpr)", time.time() - t)

With Python 3.8 on Linux I get this (times can vary a bit, but usually the differences are similar).

tuple(list) 0.3450343608856201 tuple(map) 0.32802605628967285 tuple(genexpr) 0.4420166015625

So yes, I'll change to maps.

markotoplak · 2021-03-18T14:41:53Z

/rebase

Domain conversion in batches, which saves memory for intermediate results. Information about domain conversion is catched and reused within all parts.

This allows better caching of domain transformations in from_table

Table.from_table_rows is now needed within get_columns

markotoplak mentioned this pull request Jan 28, 2021

Normalization is inefficient in memory and time #5219

Closed

1 task

janezd self-assigned this Jan 29, 2021

janezd reviewed Jan 29, 2021

View reviewed changes

janezd removed their assignment Jan 29, 2021

markotoplak changed the title ~~[ENH][RFC] Domain transformations in batches for less memory use~~ [ENH] Domain transformations in batches for less memory use Feb 1, 2021

markotoplak added the merge after release Potentially unstable and needs to be tested well. label Feb 1, 2021

janezd assigned lanzagar Feb 5, 2021

irgolic removed the merge after release Potentially unstable and needs to be tested well. label Mar 12, 2021

markotoplak force-pushed the from-table-progress branch from 05df183 to 489d232 Compare March 19, 2021 07:49

markotoplak added 12 commits March 19, 2021 09:14

Benchmark domain conversions

064120e

benchmark: add class benchmark

81ef4a0

Remove callbacks from benchmarks

3bbd6ed

Table.from_table: refactor caching

e1942f1

refactor Table.from_table and its get_columns

b34a95b

Table.from_table: domain conversion in batches

c9894a9

Domain conversion in batches, which saves memory for intermediate results. Information about domain conversion is catched and reused within all parts.

Transformation avoids creating new domains for each transform

80f38c4

This allows better caching of domain transformations in from_table

_optimize_indices: optimize single column selection

32eec7b

assure_column_dense: handle common case fast

004847e

from_table slicing test: do not block from_table_rows calls

14908fb

Table.from_table_rows is now needed within get_columns

partwise from_table, adopt tests for smallest part size = 10

9273896

Table.from_table: set batch size to 5000

7d8e1fd

markotoplak force-pushed the from-table-progress branch from 489d232 to 7d8e1fd Compare March 19, 2021 08:14

map instead of list comprehension in _idcache_*

ebf5e47

lanzagar merged commit a8b3c05 into biolab:master Mar 22, 2021

markotoplak deleted the from-table-progress branch November 25, 2021 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Domain transformations in batches for less memory use #5218

[ENH] Domain transformations in batches for less memory use #5218

markotoplak commented Jan 28, 2021

codecov bot commented Jan 28, 2021 •

edited

Loading

janezd left a comment

janezd Jan 29, 2021

markotoplak Mar 19, 2021 •

edited

Loading

janezd Mar 19, 2021

ales-erjavec Mar 19, 2021

markotoplak Mar 20, 2021

markotoplak commented Mar 18, 2021

[ENH] Domain transformations in batches for less memory use #5218

[ENH] Domain transformations in batches for less memory use #5218

Conversation

markotoplak commented Jan 28, 2021

Issue

Description of changes

Includes

codecov bot commented Jan 28, 2021 • edited Loading

Codecov Report

janezd left a comment

Choose a reason for hiding this comment

janezd Jan 29, 2021

Choose a reason for hiding this comment

markotoplak Mar 19, 2021 • edited Loading

Choose a reason for hiding this comment

janezd Mar 19, 2021

Choose a reason for hiding this comment

ales-erjavec Mar 19, 2021

Choose a reason for hiding this comment

markotoplak Mar 20, 2021

Choose a reason for hiding this comment

markotoplak commented Mar 18, 2021

codecov bot commented Jan 28, 2021 •

edited

Loading

markotoplak Mar 19, 2021 •

edited

Loading