add support for loops in CF graphs

amakelov · Jul 5, 2024 · a2c3b0f · a2c3b0f
1 parent f59aa99
commit a2c3b0f
Show file tree

Hide file tree

Showing 40 changed files with 3,482 additions and 3,128 deletions.
diff --git a/docs/docs/blog/cf.md b/docs/docs/blog/cf.md
@@ -1,9 +1,9 @@
 # Tidy computations
 In data-driven fields like machine learning, a lot of effort is spent organizing
-computational data so that it can be analyzed and manipulated. This blog
-post introduces the `ComputationFrame` (CF) data structure, which provides a
-natural and simple grammar of operations to automate this. It is implemented as
-[part of](https://amakelov.github.io/mandala/03_cf/)
+*computational data* &mdash; results of running programs &mdash; so that it can be analyzed
+and manipulated. This blog post introduces the `ComputationFrame` (CF) data
+structure, which provides a natural and simple grammar of operations to automate
+this. It is implemented as [part of](https://amakelov.github.io/mandala/03_cf/)
 [mandala](https://github.com/amakelov/mandala), a Python library for experiment
 tracking and incremental computation.
 

diff --git a/docs/docs/topics/02_retracing.md b/docs/docs/topics/02_retracing.md
@@ -82,7 +82,7 @@ with storage:
     Loading data
     Training model
     Getting accuracy
-    AtomRef(1.0, hid='d16...', cid='b67...')
+    AtomRef(0.99, hid='d16...', cid='12a...')
 
 
 ## Retracing your steps with memoization
@@ -102,8 +102,8 @@ with storage:
 ```
 
     AtomRef(hid='d0f...', cid='908...', in_memory=False) AtomRef(hid='f1a...', cid='69f...', in_memory=False)
-    AtomRef(hid='caf...', cid='5b8...', in_memory=False)
-    AtomRef(hid='d16...', cid='b67...', in_memory=False)
+    AtomRef(hid='caf...', cid='bf2...', in_memory=False)
+    AtomRef(hid='d16...', cid='12a...', in_memory=False)
 
 
 This puts all the `Ref`s along the way in your local variables (as if you've
@@ -118,7 +118,7 @@ storage.unwrap(acc)
 
 
 
-    1.0
+    0.99
 
 
 
@@ -140,14 +140,14 @@ with storage:
             print(acc)
 ```
 
-    AtomRef(hid='d16...', cid='b67...', in_memory=False)
+    AtomRef(hid='d16...', cid='12a...', in_memory=False)
     Training model
     Getting accuracy
-    AtomRef(1.0, hid='6fd...', cid='b67...')
+    AtomRef(0.99, hid='6fd...', cid='12a...')
     Loading data
     Training model
     Getting accuracy
-    AtomRef(0.84, hid='158...', cid='6c4...')
+    AtomRef(0.86, hid='158...', cid='70e...')
     Training model
     Getting accuracy
     AtomRef(0.91, hid='214...', cid='97b...')
@@ -178,8 +178,8 @@ with storage:
                 print(n_class, n_estimators, storage.unwrap(acc))
 ```
 
-    2 5 1.0
-    2 10 1.0
+    2 5 0.99
+    2 10 0.99
     5 10 0.91
 
 
@@ -199,5 +199,5 @@ with storage:
             print(storage.unwrap(acc), storage.unwrap(model))
 ```
 
-    0.84 RandomForestClassifier(max_depth=2, n_estimators=5)
+    0.86 RandomForestClassifier(max_depth=2, n_estimators=5)
 
diff --git a/docs/docs/topics/03_cf.md b/docs/docs/topics/03_cf.md
@@ -259,14 +259,17 @@ print(cf.df(values='refs').to_markdown())
 ```
 
     Extracting tuples from the computation graph:
-        var_0@output_0, var_1@output_1 = train_model(y_train=y_train, n_estimators=n_estimators, X_train=X_train)
-    Joining on columns: {'y_train', 'X_train', 'n_estimators', 'train_model'}
-    |    | X_train                                              | n_estimators                                         | y_train                                              | train_model                                   | var_1                                                | var_0                                                |
+        var_0@output_0, var_1@output_1 = train_model(X_train=X_train, n_estimators=n_estimators, y_train=y_train)
+    Found variables {'var_0', 'var_1'} containing final elements
+        For variable var_1, found dependencies in nodes Index(['X_train', 'n_estimators', 'var_1', 'y_train', 'train_model'], dtype='object')
+        For variable var_0, found dependencies in nodes Index(['X_train', 'n_estimators', 'var_0', 'y_train', 'train_model'], dtype='object')
+       Merging history for the variable var_0 on columns: {'y_train', 'X_train', 'train_model', 'n_estimators'}
+    |    | y_train                                              | n_estimators                                         | X_train                                              | train_model                                   | var_1                                                | var_0                                                |
     |---:|:-----------------------------------------------------|:-----------------------------------------------------|:-----------------------------------------------------|:----------------------------------------------|:-----------------------------------------------------|:-----------------------------------------------------|
-    |  0 | AtomRef(hid='efa...', cid='a6d...', in_memory=False) | AtomRef(hid='98c...', cid='29d...', in_memory=False) | AtomRef(hid='faf...', cid='83f...', in_memory=False) | Call(train_model, cid='c4f...', hid='5f7...') | AtomRef(hid='760...', cid='46b...', in_memory=False) | AtomRef(hid='b25...', cid='462...', in_memory=False) |
-    |  1 | AtomRef(hid='efa...', cid='a6d...', in_memory=False) | AtomRef(hid='9fd...', cid='4ac...', in_memory=False) | AtomRef(hid='faf...', cid='83f...', in_memory=False) | Call(train_model, cid='5af...', hid='514...') | AtomRef(hid='784...', cid='238...', in_memory=False) | AtomRef(hid='331...', cid='e64...', in_memory=False) |
-    |  2 | AtomRef(hid='efa...', cid='a6d...', in_memory=False) | AtomRef(hid='235...', cid='c04...', in_memory=False) | AtomRef(hid='faf...', cid='83f...', in_memory=False) | Call(train_model, cid='204...', hid='c55...') | AtomRef(hid='5b7...', cid='f0a...', in_memory=False) | AtomRef(hid='208...', cid='c75...', in_memory=False) |
-    |  3 | AtomRef(hid='efa...', cid='a6d...', in_memory=False) | AtomRef(hid='120...', cid='9bc...', in_memory=False) | AtomRef(hid='faf...', cid='83f...', in_memory=False) | Call(train_model, cid='3be...', hid='e60...') | AtomRef(hid='646...', cid='acb...', in_memory=False) | AtomRef(hid='522...', cid='d5a...', in_memory=False) |
+    |  0 | AtomRef(hid='faf...', cid='83f...', in_memory=False) | AtomRef(hid='9fd...', cid='4ac...', in_memory=False) | AtomRef(hid='efa...', cid='a6d...', in_memory=False) | Call(train_model, cid='5af...', hid='514...') | AtomRef(hid='784...', cid='238...', in_memory=False) | AtomRef(hid='331...', cid='e64...', in_memory=False) |
+    |  1 | AtomRef(hid='faf...', cid='83f...', in_memory=False) | AtomRef(hid='98c...', cid='29d...', in_memory=False) | AtomRef(hid='efa...', cid='a6d...', in_memory=False) | Call(train_model, cid='c4f...', hid='5f7...') | AtomRef(hid='760...', cid='46b...', in_memory=False) | AtomRef(hid='b25...', cid='462...', in_memory=False) |
+    |  2 | AtomRef(hid='faf...', cid='83f...', in_memory=False) | AtomRef(hid='235...', cid='c04...', in_memory=False) | AtomRef(hid='efa...', cid='a6d...', in_memory=False) | Call(train_model, cid='204...', hid='c55...') | AtomRef(hid='5b7...', cid='f0a...', in_memory=False) | AtomRef(hid='208...', cid='c75...', in_memory=False) |
+    |  3 | AtomRef(hid='faf...', cid='83f...', in_memory=False) | AtomRef(hid='120...', cid='9bc...', in_memory=False) | AtomRef(hid='efa...', cid='a6d...', in_memory=False) | Call(train_model, cid='3be...', hid='e60...') | AtomRef(hid='646...', cid='acb...', in_memory=False) | AtomRef(hid='522...', cid='d5a...', in_memory=False) |
 
 
 ## 
@@ -510,16 +513,26 @@ print(cf.df().drop(columns=['X_train', 'y_train']).to_markdown())
 
     Extracting tuples from the computation graph:
         X_train@output_0, y_train@output_2 = generate_dataset(random_seed=random_seed)
-        var_0@output_0, var_1@output_1 = train_model(y_train=y_train, n_estimators=n_estimators, X_train=X_train)
+        var_0@output_0, var_1@output_1 = train_model(X_train=X_train, n_estimators=n_estimators, y_train=y_train)
         var_2@output_0 = eval_model(model=var_0)
-    Joining on columns: {'random_seed', 'y_train', 'X_train', 'generate_dataset', 'n_estimators', 'train_model'}
-    Joining on columns: {'random_seed', 'y_train', 'X_train', 'generate_dataset', 'var_0', 'n_estimators', 'train_model'}
-    |    |   n_estimators |   random_seed | generate_dataset                                   | train_model                                   |   var_1 | var_0                                                | eval_model                                   |   var_2 |
-    |---:|---------------:|--------------:|:---------------------------------------------------|:----------------------------------------------|--------:|:-----------------------------------------------------|:---------------------------------------------|--------:|
-    |  0 |             80 |            42 | Call(generate_dataset, cid='19a...', hid='c3f...') | Call(train_model, cid='3be...', hid='e60...') |    0.83 | RandomForestClassifier(max_depth=2, n_estimators=80) | Call(eval_model, cid='137...', hid='d32...') |    0.82 |
-    |  1 |             40 |            42 | Call(generate_dataset, cid='19a...', hid='c3f...') | Call(train_model, cid='5af...', hid='514...') |    0.82 | RandomForestClassifier(max_depth=2, n_estimators=40) | Call(eval_model, cid='38f...', hid='5d3...') |    0.81 |
-    |  2 |             20 |            42 | Call(generate_dataset, cid='19a...', hid='c3f...') | Call(train_model, cid='204...', hid='c55...') |    0.8  | RandomForestClassifier(max_depth=2, n_estimators=20) |                                              |  nan    |
-    |  3 |             10 |            42 | Call(generate_dataset, cid='19a...', hid='c3f...') | Call(train_model, cid='c4f...', hid='5f7...') |    0.74 | RandomForestClassifier(max_depth=2, n_estimators=10) |                                              |  nan    |
+    Found variables {'var_2', 'var_0', 'var_1'} containing final elements
+        For variable var_1, found dependencies in nodes Index(['X_train', 'n_estimators', 'var_1', 'y_train', 'random_seed',
+           'train_model', 'generate_dataset'],
+          dtype='object')
+        For variable var_0, found dependencies in nodes Index(['X_train', 'n_estimators', 'var_0', 'y_train', 'random_seed',
+           'train_model', 'generate_dataset'],
+          dtype='object')
+        For variable var_2, found dependencies in nodes Index(['var_2', 'X_train', 'n_estimators', 'y_train', 'random_seed', 'var_0',
+           'train_model', 'generate_dataset', 'eval_model'],
+          dtype='object')
+       Merging history for the variable var_0 on columns: {'X_train', 'train_model', 'n_estimators', 'y_train', 'random_seed', 'generate_dataset'}
+       Merging history for the variable var_2 on columns: {'X_train', 'train_model', 'n_estimators', 'y_train', 'random_seed', 'generate_dataset', 'var_0'}
+    |    |   random_seed | generate_dataset                                   |   n_estimators | train_model                                   |   var_1 | var_0                                                | eval_model                                   |   var_2 |
+    |---:|--------------:|:---------------------------------------------------|---------------:|:----------------------------------------------|--------:|:-----------------------------------------------------|:---------------------------------------------|--------:|
+    |  0 |            42 | Call(generate_dataset, cid='19a...', hid='c3f...') |             40 | Call(train_model, cid='5af...', hid='514...') |    0.82 | RandomForestClassifier(max_depth=2, n_estimators=40) | Call(eval_model, cid='38f...', hid='5d3...') |    0.81 |
+    |  1 |            42 | Call(generate_dataset, cid='19a...', hid='c3f...') |             10 | Call(train_model, cid='c4f...', hid='5f7...') |    0.74 | RandomForestClassifier(max_depth=2, n_estimators=10) |                                              |  nan    |
+    |  2 |            42 | Call(generate_dataset, cid='19a...', hid='c3f...') |             20 | Call(train_model, cid='204...', hid='c55...') |    0.8  | RandomForestClassifier(max_depth=2, n_estimators=20) |                                              |  nan    |
+    |  3 |            42 | Call(generate_dataset, cid='19a...', hid='c3f...') |             80 | Call(train_model, cid='3be...', hid='e60...') |    0.83 | RandomForestClassifier(max_depth=2, n_estimators=80) | Call(eval_model, cid='137...', hid='d32...') |    0.82 |
 
 
 Importantly, we see that some computations only partially follow the full