Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address various scheduler timing issues #1069

Merged

Conversation

seriousben
Copy link
Member

@seriousben seriousben commented Nov 26, 2024

Context

Running test_graph_behavior.py continuously with as many as 9 executors results in various failures that do not show themselves with only one executor.

Error seen and addressed:

  1. Executor stuck in an infinite loop when a scheduler loop processes 2 state changes.
  2. Reducer function returning more than the expected single output when it a reducer finishes before the next parent output finishes.
  3. Reducer function returning more than the expected single output when the ingest file for it happens right before a scheduler run loop.

What

In this PR, on top of addressing the edge cases found we are also adding lots of traces and making sure a scheduler error will not block the loop to other state changes.

Reducer problem 1

image

image

Reducer problem 2

image

image

Known edge case to address in a future PR: the scheduler run loops expects to process state changes for a single compute graph at a time. This is an incorrect assumption and can results in edge cases.

Testing

In order to test fixes and detect edge cases, I have detected errors by running the following:

TEST_MAX=500 INDEXIFY_URL=http://localhost:8900 command_stress_test poetry run python -u -m unittest tests/test_graph_behaviours.py 2>&1 | tee test-out.log

command_stress_test is https://github.com/seriousben/serious-nixos-config/blob/main/home-manager/files/command_stress_test.fish

Before these changes:

After 480s (107/500 runs) ALL executors become very quickly stuck in a loop doing ingest_file for already finished tasks.

We can still infer these failures:

     12 FAIL: test_map_reduce_operation_1 (tests.test_graph_behaviours.TestGraphBehaviors.test_map_reduce_operation_1)
     11 FAIL: test_pipeline_1 (tests.test_graph_behaviours.TestGraphBehaviors.test_pipeline_1)
      8 FAIL: test_router_graph_behavior_1 (tests.test_graph_behaviours.TestGraphBehaviors.test_router_graph_behavior_1)

After these changes:

===========================
Success Rate = 99.8%
===========================
Failures     = 1
Total        = 500
Elapsed      = 1566s

The single failure seen in this run is:

FAIL: test_map_reduce_operation_1 (tests.test_graph_behaviours.TestGraphBehaviors.test_map_reduce_operation_1)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/seriousben/Library/Caches/pypoetry/virtualenvs/indexify-dlsxfW2b-py3.11/lib/python3.11/site-packages/parameterized/parameterized.py", line 620, in standalone_func
    return func(*(a + p.args), **p.kwargs, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/seriousben/src/tensorlakeai/indexify/python-sdk/tests/test_graph_behaviours.py", line 318, in test_map_reduce_operation
    self.assertEqual(output_sum_sq, [Sum(val=5)])
AssertionError: Lists differ: [Sum(val=1)] != [Sum(val=5)]

First differing element 0:
Sum(val=1)
Sum(val=5)

- [Sum(val=1)]
?          ^

+ [Sum(val=5)]
?

Future work will look into this other edge case.

Contribution Checklist

  • If the python-sdk was changed, please run make fmt in python-sdk/.
  • If the server was changed, please run make fmt in server/.
  • Make sure all PR Checks are passing.

server/src/scheduler.rs Outdated Show resolved Hide resolved
server/src/scheduler.rs Outdated Show resolved Hide resolved
@seriousben seriousben requested a review from diptanu November 26, 2024 21:54
@diptanu
Copy link
Collaborator

diptanu commented Nov 27, 2024

@seriousben Can you run some tests on the following scenarios -

  1. Create a graph, with no executor, invoke the graph(it will create tasks, but no allocations), delete the invocation, bring an executor(it will create allocation) -- expecting something to break here
  2. Create a graph, with an executor, invoke the graph with a function that takes 10 seconds to complete, meanwhile delete the graph, and let the task complete -- expecting something to break.
  3. Run the graph with a map reduce with a sequence of 100000000 and 15 executors running on docker locally. The map and reduce will run concurrently on all the machines -- expecting a lot to break here

server/src/scheduler.rs Outdated Show resolved Hide resolved
@@ -111,11 +99,41 @@ impl Scheduler {
},
diagnostic_msgs,
}),
state_changes_processed: processed_state_changes,
state_changes_processed: processed_state_changes.iter().map(|x| x.id).collect(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do? Are we filtering something here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are mapping the processes_state_changes to their id.

@seriousben seriousben changed the title make scheduler process its work without being blocked on errors Address various scheduler timing issues Dec 1, 2024
@seriousben seriousben force-pushed the seriousben/scheduler-process-all-state-changes-on-error branch from 3b738fd to 94b1526 Compare December 1, 2024 20:50
server/data_model/src/lib.rs Outdated Show resolved Hide resolved

pub fn get_compute_parent(&self, node_name: &str) -> Option<&str> {
// Find parent of the node
self.edges
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just precompute this in a hash map in the ComputeGraph object. But the logic seems fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simplicity and because it is only needed for a edge case, I would like to postpone precomputing it. precomputing comes with challeneges like support for existing graphs that I would prefer not tackle in this PR.

server/src/scheduler.rs Outdated Show resolved Hide resolved
task_key = task.key(),
"Task already completed but allocation still exists, deleting allocation",
);
txn.delete_cf(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should check this in. This feels like a bandaid. Let's investigate some more before we do this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the root cause as part of this PR. But without this, we risk loosing executors stuck in a bad state.

I think if this happens in the future it should be an alert and we should debug it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the root cause is fixed and this will prevent outages in case a similar problem happens in the future, I would like to keep this and have an alert to get us to investigate and fix other root causes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of discussion: Since the root cause is fixed, we'll go ahead with this change.

@seriousben seriousben force-pushed the seriousben/scheduler-process-all-state-changes-on-error branch 4 times, most recently from 9f7ad73 to ad18f21 Compare December 2, 2024 00:58
if requires_task_allocation {
let task_placement_result = self.task_allocator.schedule_unplaced_tasks()?;
new_allocations.extend(task_placement_result.task_placements);
diagnostic_msgs.extend(task_placement_result.diagnostic_msgs);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what could cause the same task to be allocated multiple times.

@seriousben seriousben requested a review from diptanu December 2, 2024 02:48
@seriousben seriousben force-pushed the seriousben/scheduler-process-all-state-changes-on-error branch from 41c38d9 to 3f2500f Compare December 2, 2024 11:36
@seriousben seriousben force-pushed the seriousben/scheduler-process-all-state-changes-on-error branch from 3f2500f to 3cf4f75 Compare December 2, 2024 11:42
@seriousben
Copy link
Member Author

Merging to get rid of lots of timing issues. I am happy to make quick changes before next release as needed @diptanu.

@seriousben seriousben merged commit deee6c0 into main Dec 2, 2024
5 checks passed
@seriousben seriousben deleted the seriousben/scheduler-process-all-state-changes-on-error branch December 2, 2024 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants