-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot handle huge command.txt files #125
Comments
Congrats! You broke Linux :D |
I think a have a better understanding of the issue now. When the file We currently need to read and rewrite the ~2.5Gb file each time we get the next command to run. This is because we consider the next command to run as the first one in We should take the last line of |
@MarcCote What you say is true, but this is 2 distinct issues. @MarcCote I have a vague memory that we had a proper reason to do so in the past. There is also the fact that it was NEVER intended to be used in this way. Finally, I think that having the unfinished_commands.txt and/or having a database might mitigate all this. |
Maybe those are two distinct issues. Please @gauvinalexandre let us know :). @mgermain |
@MarcCote Well in the case of the 70 brains, in a way, SD is used as a substitute to MPI to implement parallelism inside their own program. Maybe another way to say what I said in my previous post is that SD was never designed to launch an actual 50 Million jobs at once :P If we can find a way to do it nicely I'm not against it though. |
Hey guys! Yes, I agree it's 2 different issues, but they are related in some way. My issue was that my tasks had a very short process-time, while the command.txt (big, but not gygabytes) was relatively long to write. So, the workers were continuously asking for more tasks. Then Linux was thinking it was stuck in a multiprocess concurrent lock because of its OS default waiting timeout parameter, so it killed workers to prevent it. In other words, workers we're dying bored (thanks for @mgermain for figuring this out). In Max's case, the file is very long to write. So I guess the same will happen at the end, killing workers with boredom waiting for millions of tasks to be rewritten millions of times. So in one side, there's overhead in reading too frequently the command.txt file, while in the other there's overhead writing too much. I think it's related because the process-time of tasks over the number of tasks should be a reasonable ratio. Otherwise we'll have problems. The final message here is, we cannot just throw anything there, yet, until geniuses like @mgermain and @MarcCote find a solution. There's a smart-dispatch tweaking part. Thanks again guys! |
From what I understand speeding up the "picking a new command to execute" would solve both issues. I suggest "taking the last line of |
Attached, an example file from a worker that had hard time...
Further details to come tomorrow...
@mgermain
770146_mp2_m_worker_105_e.txt
The text was updated successfully, but these errors were encountered: